I am trying to build a Java whois scrapper for
grweb . ics . forth . gr/public/Whois?lang=el
My first approach was to use htmlunit; however, it seems i should be able to get by with a simple HttpsURLConnection
Here is how my program works:
Sends a GET request to the url above
REQUEST
RESPONSE
Parses the html response, gets the Captcha and solves it with an external API
Also, stores the cookie from the response headers
Sends a POST request to the same url with the cookie saved from the first request
The request also includes the POST data:
domainName=forthnet.gr&lang=el&submit=%CE%88%CE%BB%CE%B5%CE%B3%CF%87%CE%BF%CF%82&j_captcha_response=80243
REQUEST
RESPONSE
Unfortunately, the response i get is 500 Internal Error
I have checked the headers one by one, and they are exactly the same i have in a Firefox scenario that completes without problems.
Note that, you can get a 500 error in Firefox if you delete the JSESSIONID cookie before submitting the form. However, i do submit the cookie when i use the java bot.
I have run out of ideas in terms of what to check next. The page does not seem to be running any fancy javascript :/
Heeeelp! :cow04:
grweb . ics . forth . gr/public/Whois?lang=el
My first approach was to use htmlunit; however, it seems i should be able to get by with a simple HttpsURLConnection
Here is how my program works:
Sends a GET request to the url above
REQUEST
Code:
GET /public/Whois?lang=el HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.5
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Host: grweb.ics.forth.gr
Connection: keep-alive
Code:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=B9097AFADF5D3761E11559F867481CB4; Path=/public/; Secure; HttpOnly
Content-Type: text/html;charset=UTF-8
Date: Mon, 21 Apr 2014 07:39:58 GMT
Content-Length: 16754
Parses the html response, gets the Captcha and solves it with an external API
Also, stores the cookie from the response headers
Sends a POST request to the same url with the cookie saved from the first request
The request also includes the POST data:
domainName=forthnet.gr&lang=el&submit=%CE%88%CE%BB%CE%B5%CE%B3%CF%87%CE%BF%CF%82&j_captcha_response=80243
REQUEST
Code:
POST /public/Whois?lang=el HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.5
Cookie: JSESSIONID=B9097AFADF5D3761E11559F867481CB4
Referer: (actual url - cannot post)
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Host: grweb.ics.forth.gr
Connection: keep-alive
Content-type: application/x-www-form-urlencoded
Content-Length: 103
Code:
HTTP/1.1 500 Internal Server Error
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=utf-8
Date: Mon, 21 Apr 2014 07:40:04 GMT
Connection: close
Content-Length: 538
Unfortunately, the response i get is 500 Internal Error
I have checked the headers one by one, and they are exactly the same i have in a Firefox scenario that completes without problems.
Note that, you can get a 500 error in Firefox if you delete the JSESSIONID cookie before submitting the form. However, i do submit the cookie when i use the java bot.
I have run out of ideas in terms of what to check next. The page does not seem to be running any fancy javascript :/
Heeeelp! :cow04:
Last edited: