Whois Scrapper Challenge

Discussion in 'General Programming Chat' started by orizon, Apr 21, 2014.

  1. orizon

    orizon Newbie

    Joined:
    Mar 29, 2014
    Messages:
    24
    Likes Received:
    0
    Location:
    Athens, Greece
    I am trying to build a Java whois scrapper for
    grweb . ics . forth . gr/public/Whois?lang=el

    My first approach was to use htmlunit; however, it seems i should be able to get by with a simple HttpsURLConnection
    Here is how my program works:

    Sends a GET request to the url above
    REQUEST
    Code:
    GET /public/Whois?lang=el HTTP/1.1
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Encoding: gzip, deflate
    Accept-Language: en-US,en;q=0.5
    User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
    Host: grweb.ics.forth.gr
    Connection: keep-alive
    
    RESPONSE
    Code:
    HTTP/1.1 200 OK
    Server: Apache-Coyote/1.1
    Set-Cookie: JSESSIONID=B9097AFADF5D3761E11559F867481CB4; Path=/public/; Secure; HttpOnly
    Content-Type: text/html;charset=UTF-8
    Date: Mon, 21 Apr 2014 07:39:58 GMT
    Content-Length: 16754
    
    Parses the html response, gets the Captcha and solves it with an external API
    Also, stores the cookie from the response headers

    Sends a POST request to the same url with the cookie saved from the first request
    The request also includes the POST data:
    domainName=forthnet.gr&lang=el&submit=%CE%88%CE%BB%CE%B5%CE%B3%CF%87%CE%BF%CF%82&j_captcha_response=80243

    REQUEST
    Code:
    POST /public/Whois?lang=el HTTP/1.1
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Encoding: gzip, deflate
    Accept-Language: en-US,en;q=0.5
    Cookie: JSESSIONID=B9097AFADF5D3761E11559F867481CB4
    Referer: (actual url - cannot post)
    User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
    Host: grweb.ics.forth.gr
    Connection: keep-alive
    Content-type: application/x-www-form-urlencoded
    Content-Length: 103
    
    RESPONSE
    Code:
    HTTP/1.1 500 Internal Server Error
    Server: Apache-Coyote/1.1
    Content-Type: text/html;charset=utf-8
    Date: Mon, 21 Apr 2014 07:40:04 GMT
    Connection: close
    Content-Length: 538
    
    Unfortunately, the response i get is 500 Internal Error

    I have checked the headers one by one, and they are exactly the same i have in a Firefox scenario that completes without problems.
    Note that, you can get a 500 error in Firefox if you delete the JSESSIONID cookie before submitting the form. However, i do submit the cookie when i use the java bot.

    I have run out of ideas in terms of what to check next. The page does not seem to be running any fancy javascript :/

    Heeeelp! :cow04:
     
    Last edited: Apr 21, 2014
  2. Quebeck

    Quebeck Newbie

    Joined:
    Jun 29, 2013
    Messages:
    11
    Likes Received:
    2
    Compared with firebug your data seems okay. I'm not sure if copied
    direct out of your programms, but if there is the bug.
    The captcha response contains a whitespace.
     
  3. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,891
    Likes Received:
    12,793
    Occupation:
    Potentate
    Location:
    Asuncion
    Let 's get the most common thing out first, which is that you missed something. Post the full POST headers/body in both environments (FF, bot) so we can take a look.