Can anyone solve this puzzle!?!

MarketerX · Feb 20, 2012

PROBLEM: I want to scrape text from a website, but the text is actually not in the page source, it is being generated by obfuscated javascript. (Obviously they are trying to prevent what I am trying to do.)

The only solution I was able to come up with would be to take a screenshot of the page, crop out the section with the text i want to scrape, then run it through an OCR engine.

Is this the only solution, guys? The js is well obfuscated...

Helpppppp :y:

jazzc · Feb 20, 2012

Use .NET with a webbrowser control or anything equivalent in other languages. It will let JS run and then you can scrape it.

gr8divas · Feb 20, 2012

use text only version of google cache and scrape the text

MarketerX · Feb 20, 2012

Thank you guys. I might do it the hard way just for fun...probably not though.

Really appreciate it.

Humble · Feb 20, 2012

jazzc said:
Use .NET with a webbrowser control or anything equivalent in other languages. It will let JS run and then you can scrape it.

I suggest jazzc's idea, this way is the most convient in my opnion. Just create a web control and scrape the element's contents.

Good luck!

Cynikal · Feb 21, 2012

Why not use the .net code behind web client and 'download string'? that should give you the source code as well as the javascripted new text, sense, it IS accessing the page?

What page are you trying to scrape? I can take a look to see if theres a more efficient way of doing things than opening a resource hungry web browser control (Afterall, it is IE)

MarketerX · Feb 21, 2012

UPDATE: The webBrowser control is not returning de-obfuscated JS

Here are the pages I am trying to scrape...

http://freeproxylists.com/elite.html

Check any of the 10 daily lists on this page, when you load 1 the IP's/ports are not in the page source, even if loaded with the WebBrowser control

Maybe I am missing something??

MarketerX · Feb 21, 2012

jazzc said:
Use .NET with a webbrowser control or anything equivalent in other languages. It will let JS run and then you can scrape it.

wb control isn't returning the de-obfuscated plaintext I am trying to scrape

healzer · Feb 21, 2012

what kind of text do you want to scrape?
The proxy/ip's ?

lancis · Feb 21, 2012

Its hard to judge without seeing the page itself. But the most common scheme works like this:

Browser version:

- You load html page with JavaScript
- Javascript makes an AJAX request to the server
- When server responds some fields are updated

Scraper version:

- Simulate AJAX request to the server with CURL

Cynikal · Feb 21, 2012

I sent you a response to solve your problem, but i'm not sure if it went through.

Did it go through for you? if not i'll paste it here.

MarketerX · Feb 21, 2012

Hey guys, Cynikal solved this one via PM, but he can post it in here if he wants. It's actually pretty simple to bypass this particular sites technique, but hopefully he can help me with another one that I think will probably be harder

P.S. Yes I am trying to scrape their daily proxy lists.

Cynikal · Feb 21, 2012

Ah, well since you posted the site, i'll post it up here since you're not worried about this "secret".

MarketerX said:
Hey...here is the page I am trying to scrape, I have a few others that also use this same method to keep their proxies out of the page source (in plain text that is)

(im not allowed to post links yet) but its the elite.html

go there and click any of the 10 daily lists.

I actually just tried the webbbrowser control and the JS is still obfuscated, I can't find any of the IP's in the page source :\

Help would be appreciated

I see what you mean via the source,

However,

this is what ya gotta do.

Load the page, look in the source for:

<body link="#111111" vlink="#111111" alink="#000000" onload="loadData('dataID', '/load_elite_1329825631.html');">

the "load_elite_numbers.html" is what you want to scrape.

THEN

once you get that,

load the site: proxysite_dot_com/load_elite_1329825631.html (replace the numbers...)

then parse that for your ips.

confined · Feb 21, 2012

it's actually not even obfuscated.

for example:

http://freeproxylists.com/load_elite_1329864626.html

is the source for http://freeproxylists.com/elite/1329864626.html

MarketerX · Feb 21, 2012

confined said:
it's actually not even obfuscated.

for example:

http://freeproxylists.com/load_elite_1329864626.html

is the source for http://freeproxylists.com/elite/1329864626.html

Yeah, exactly what Cynikal noticed and I didn't because I am trying to scrape a bunch of sites and was just going to skip this one because it was giving me problems.

Problem solved, but let's see if he can help me on http://seprox.ru/en/proxy_filter/0_0_0_0_0_0_0_0_0_0.html

Or anyone

This one should be harder.

fuserleer · Feb 21, 2012

if you are working in Java, get HTML Unit

It has a Javascript processor built in, then you can travel the resulting DOM as if you were looking at it in Firebug or whatever.

fuserleer · Feb 21, 2012

ooops, I didnt see this was in .NET....ignore me, sorry!

MarketerX · Feb 22, 2012

fuserleer said:
ooops, I didnt see this was in .NET....ignore me, sorry!

It's ok, I appreciate the response.

If anyone can lead me in the right direction towards scraping the list on this page http://seprox.ru/en/proxy_filter/0_0_0_0_0_0_0_0_0_0.html

I would be extremely grateful..I want to do it for the learning experience since they went to measures to prevent people from scraping their list. The first one was actually an easy solution, this one should be more difficult.

MarketerX · Feb 22, 2012

I figured it out

Can anyone solve this puzzle!?!

Regular Member

Elite Member

Registered Member

Regular Member

Registered Member

Newbie

Regular Member

Regular Member

Elite Member

Elite Member

Newbie

Regular Member

Newbie

Regular Member

Regular Member

BANNED

BANNED

Regular Member

Regular Member

Main Menu

Marketplace

Making Money

BlackHat World