Whois Scrapper Challenge

orizon

Newbie
Joined
Mar 29, 2014
Messages
27
Reaction score
2
I am trying to build a Java whois scrapper for
grweb . ics . forth . gr/public/Whois?lang=el

My first approach was to use htmlunit; however, it seems i should be able to get by with a simple HttpsURLConnection
Here is how my program works:

Sends a GET request to the url above
REQUEST
Code:
GET /public/Whois?lang=el HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.5
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Host: grweb.ics.forth.gr
Connection: keep-alive
RESPONSE
Code:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=B9097AFADF5D3761E11559F867481CB4; Path=/public/; Secure; HttpOnly
Content-Type: text/html;charset=UTF-8
Date: Mon, 21 Apr 2014 07:39:58 GMT
Content-Length: 16754

Parses the html response, gets the Captcha and solves it with an external API
Also, stores the cookie from the response headers

Sends a POST request to the same url with the cookie saved from the first request
The request also includes the POST data:
domainName=forthnet.gr&lang=el&submit=%CE%88%CE%BB%CE%B5%CE%B3%CF%87%CE%BF%CF%82&j_captcha_response=80243

REQUEST
Code:
POST /public/Whois?lang=el HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.5
Cookie: JSESSIONID=B9097AFADF5D3761E11559F867481CB4
Referer: (actual url - cannot post)
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Host: grweb.ics.forth.gr
Connection: keep-alive
Content-type: application/x-www-form-urlencoded
Content-Length: 103
RESPONSE
Code:
HTTP/1.1 500 Internal Server Error
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=utf-8
Date: Mon, 21 Apr 2014 07:40:04 GMT
Connection: close
Content-Length: 538

Unfortunately, the response i get is 500 Internal Error

I have checked the headers one by one, and they are exactly the same i have in a Firefox scenario that completes without problems.
Note that, you can get a 500 error in Firefox if you delete the JSESSIONID cookie before submitting the form. However, i do submit the cookie when i use the java bot.

I have run out of ideas in terms of what to check next. The page does not seem to be running any fancy javascript :/

Heeeelp! :cow04:
 
Last edited:
Compared with firebug your data seems okay. I'm not sure if copied
domainName=forthnet.gr&lang=el&submit=%CE%88%CE%BB %CE%B5%CE%B3%CF%87%CE%BF%CF%82&j_captcha_response= 80243
direct out of your programms, but if there is the bug.
The captcha response contains a whitespace.
 
I have checked the headers one by one, and they are exactly the same i have in a Firefox scenario that completes without problems.

Let 's get the most common thing out first, which is that you missed something. Post the full POST headers/body in both environments (FF, bot) so we can take a look.
 
Back
Top