1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Options for programmatically logging into a website.

Discussion in 'General Programming Chat' started by fourth, Jan 15, 2014.

Tags:
  1. fourth

    fourth Newbie

    Joined:
    Jan 15, 2014
    Messages:
    1
    Likes Received:
    0
    I am trying to write a bot in Java to preform regular daily data scraping operations. I am having trouble with any sites that require me to log in before getting access to the important pages. I have tried htmlunit but it has proven to have too much trouble dealing with javascripts. In many cases after logging in I also need the ability to navigate by clicking links/buttons. I am looking for ideas on other ways I can try to get access to the needed pages.
     
  2. mypmmail

    mypmmail Junior Member

    Joined:
    Jan 31, 2008
    Messages:
    111
    Likes Received:
    27
    I think there are 2 different issues from your questions.

    If you wants to access authentication protected page, basically, you need to submit the session id (or cookie) with the subsequent request after a successful login.
    You can do it using the primitive way or use a library like HttpClient from Apache
    hxxp://hc.apache.org/httpclient-3.x/authentication.html
    hxxp://hc.apache.org/httpcomponents-client-ga/tutorial/html/fluent.html (under executor)

    The other question is why do you need to meddle with the javascript?
    Once you have retrieved the page, whether to interpret the javascript is up to the client side (which is you), unless it make a ajax call.
    But, in any case, you can record all the call back to the website and mimic using httpclient.

    So, if the question is how to record all the call, one of the solution will be to install a tcp mon or proxy and record all the request between the browser and the server, then, you know exactly what goes through, then you can mimic the call as you deem necessary.


    The other simpler way of doing this is to use HTTrack (which is free) and grab all the pages you want to your local machine, then, your java program just crawl through the local saved pages.

    hth.
     
  3. Raffy

    Raffy Regular Member

    Joined:
    Nov 30, 2012
    Messages:
    212
    Likes Received:
    613
    A headless browser is what you need (the same kind of bots Google uses). Interacts with jquery, ajax, and flash and mimics a real person using a browser but without loading a GUI.

    If you don't have to use java I'd recommend PhantomJS or CasperJS (javascript).

    Watir-Webdriver (ruby) is another option and is the easiest to use without coding knowledge. With Watir you simply perform the actions you want to automate using firefox+test wise recorder plugin and it spits out ruby code that you copy & paste into a .rb file (tutorial).

    If you must use java, there's probably a java solution I'm just not familiar with it so you'll have to do the research yourself.
     
  4. negligence

    negligence Regular Member

    Joined:
    Jan 3, 2010
    Messages:
    256
    Likes Received:
    331
    I'd recommend Python for this project, or hell, i've done similar things in PHP since it's focused around web based data. Where are you outputting it all? It might be easier to have a script scrape it all, dump it into a file, then have your Java software display it.
     
  5. Pornguy

    Pornguy Regular Member

    Joined:
    Nov 29, 2012
    Messages:
    320
    Likes Received:
    106
    Home Page:
    My developer builds most of our scrapers in PHP I have lots of them that require login and to change something manage a profile and they work great.
     
  6. uyuyuy99

    uyuyuy99 Junior Member

    Joined:
    Jul 10, 2012
    Messages:
    129
    Likes Received:
    107
    Location:
    U.S.
    I know what you mean, HtmlUnit can be unbelievably stupid when it comes to JS. I use it for simple scraping/operations with more static websites, but for websites with JS, HTML5 and other dynamic elements, I use Selenium. Just google it and download the Java API. I like to run my selenium bots on cheap VPS's with headless browsers installed.