1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Whois Scrapper Challenge

Discussion in 'General Programming Chat' started by orizon, Apr 21, 2014.

  1. orizon

    orizon Newbie

    Joined:
    Mar 29, 2014
    Messages:
    23
    Likes Received:
    0
    Location:
    Athens, Greece
    I am trying to build a Java whois scrapper for
    grweb . ics . forth . gr/public/Whois?lang=el

    My first approach was to use htmlunit; however, it seems i should be able to get by with a simple HttpsURLConnection
    Here is how my program works:

    Sends a GET request to the url above
    REQUEST
    Code:
    GET /public/Whois?lang=el HTTP/1.1
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Encoding: gzip, deflate
    Accept-Language: en-US,en;q=0.5
    User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
    Host: grweb.ics.forth.gr
    Connection: keep-alive
    
    RESPONSE
    Code:
    HTTP/1.1 200 OK
    Server: Apache-Coyote/1.1
    Set-Cookie: JSESSIONID=B9097AFADF5D3761E11559F867481CB4; Path=/public/; Secure; HttpOnly
    Content-Type: text/html;charset=UTF-8
    Date: Mon, 21 Apr 2014 07:39:58 GMT
    Content-Length: 16754
    
    Parses the html response, gets the Captcha and solves it with an external API
    Also, stores the cookie from the response headers

    Sends a POST request to the same url with the cookie saved from the first request
    The request also includes the POST data:
    domainName=forthnet.gr&lang=el&submit=%CE%88%CE%BB%CE%B5%CE%B3%CF%87%CE%BF%CF%82&j_captcha_response=80243

    REQUEST
    Code:
    POST /public/Whois?lang=el HTTP/1.1
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Encoding: gzip, deflate
    Accept-Language: en-US,en;q=0.5
    Cookie: JSESSIONID=B9097AFADF5D3761E11559F867481CB4
    Referer: (actual url - cannot post)
    User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
    Host: grweb.ics.forth.gr
    Connection: keep-alive
    Content-type: application/x-www-form-urlencoded
    Content-Length: 103
    
    RESPONSE
    Code:
    HTTP/1.1 500 Internal Server Error
    Server: Apache-Coyote/1.1
    Content-Type: text/html;charset=utf-8
    Date: Mon, 21 Apr 2014 07:40:04 GMT
    Connection: close
    Content-Length: 538
    
    Unfortunately, the response i get is 500 Internal Error

    I have checked the headers one by one, and they are exactly the same i have in a Firefox scenario that completes without problems.
    Note that, you can get a 500 error in Firefox if you delete the JSESSIONID cookie before submitting the form. However, i do submit the cookie when i use the java bot.

    I have run out of ideas in terms of what to check next. The page does not seem to be running any fancy javascript :/

    Heeeelp! :cow04:
     
    Last edited: Apr 21, 2014
  2. Quebeck

    Quebeck Newbie

    Joined:
    Jun 29, 2013
    Messages:
    11
    Likes Received:
    2
    Compared with firebug your data seems okay. I'm not sure if copied
    direct out of your programms, but if there is the bug.
    The captcha response contains a whitespace.
     
  3. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,148
    Let 's get the most common thing out first, which is that you missed something. Post the full POST headers/body in both environments (FF, bot) so we can take a look.