1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Can anyone solve this puzzle!?!

Discussion in 'Visual Basic .NET' started by MarketerX, Feb 20, 2012.

  1. MarketerX

    MarketerX Regular Member

    Joined:
    Mar 7, 2010
    Messages:
    398
    Likes Received:
    120
    PROBLEM: I want to scrape text from a website, but the text is actually not in the page source, it is being generated by obfuscated javascript. (Obviously they are trying to prevent what I am trying to do.) :p

    The only solution I was able to come up with would be to take a screenshot of the page, crop out the section with the text i want to scrape, then run it through an OCR engine.

    Is this the only solution, guys? The js is well obfuscated...

    Helpppppp :y:
     
  2. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,468
    Likes Received:
    10,143
    Use .NET with a webbrowser control or anything equivalent in other languages. It will let JS run and then you can scrape it.
     
    • Thanks Thanks x 1
  3. gr8divas

    gr8divas Registered Member

    Joined:
    Mar 28, 2009
    Messages:
    57
    Likes Received:
    7
    use text only version of google cache and scrape the text
     
    • Thanks Thanks x 1
  4. MarketerX

    MarketerX Regular Member

    Joined:
    Mar 7, 2010
    Messages:
    398
    Likes Received:
    120
    Thank you guys. I might do it the hard way just for fun...probably not though.

    Really appreciate it.
     
  5. Humble

    Humble Registered Member

    Joined:
    Jul 17, 2010
    Messages:
    81
    Likes Received:
    51
    Occupation:
    Human
    Location:
    North American
    I suggest jazzc's idea, this way is the most convient in my opnion. Just create a web control and scrape the element's contents.

    Good luck!
     
    • Thanks Thanks x 1
    Last edited: Feb 20, 2012
  6. Cynikal

    Cynikal Newbie

    Joined:
    May 11, 2010
    Messages:
    32
    Likes Received:
    8
    Why not use the .net code behind web client and 'download string'? that should give you the source code as well as the javascripted new text, sense, it IS accessing the page?

    What page are you trying to scrape? I can take a look to see if theres a more efficient way of doing things than opening a resource hungry web browser control (Afterall, it is IE)
     
    • Thanks Thanks x 1
  7. MarketerX

    MarketerX Regular Member

    Joined:
    Mar 7, 2010
    Messages:
    398
    Likes Received:
    120
    UPDATE: The webBrowser control is not returning de-obfuscated JS :(

    Here are the pages I am trying to scrape...

    http://freeproxylists.com/elite.html

    Check any of the 10 daily lists on this page, when you load 1 the IP's/ports are not in the page source, even if loaded with the WebBrowser control :(

    Maybe I am missing something??
     
  8. MarketerX

    MarketerX Regular Member

    Joined:
    Mar 7, 2010
    Messages:
    398
    Likes Received:
    120
    wb control isn't returning the de-obfuscated plaintext I am trying to scrape :(
     
  9. healzer

    healzer Jr. Executive VIP Jr. VIP Premium Member

    Joined:
    Jun 26, 2011
    Messages:
    2,363
    Likes Received:
    1,966
    Gender:
    Male
    Occupation:
    Marketing automation tools
    Location:
    Somewhere in Europe
    Home Page:
    what kind of text do you want to scrape?
    The proxy/ip's ?
     
    • Thanks Thanks x 1
  10. lancis

    lancis Elite Member

    Joined:
    Jul 31, 2010
    Messages:
    1,632
    Likes Received:
    2,384
    Occupation:
    Entrepreneur
    Location:
    Milky Way
    Home Page:
    Its hard to judge without seeing the page itself. But the most common scheme works like this:

    Browser version:

    - You load html page with JavaScript
    - Javascript makes an AJAX request to the server
    - When server responds some fields are updated

    Scraper version:

    - Simulate AJAX request to the server with CURL
     
    • Thanks Thanks x 1
  11. Cynikal

    Cynikal Newbie

    Joined:
    May 11, 2010
    Messages:
    32
    Likes Received:
    8
    I sent you a response to solve your problem, but i'm not sure if it went through.

    Did it go through for you? if not i'll paste it here.
     
    • Thanks Thanks x 1
  12. MarketerX

    MarketerX Regular Member

    Joined:
    Mar 7, 2010
    Messages:
    398
    Likes Received:
    120
    Hey guys, Cynikal solved this one via PM, but he can post it in here if he wants. It's actually pretty simple to bypass this particular sites technique, but hopefully he can help me with another one that I think will probably be harder :p

    P.S. Yes I am trying to scrape their daily proxy lists.
     
  13. Cynikal

    Cynikal Newbie

    Joined:
    May 11, 2010
    Messages:
    32
    Likes Received:
    8
    Ah, well since you posted the site, i'll post it up here since you're not worried about this "secret".



    I see what you mean via the source,

    However,

    this is what ya gotta do.

    Load the page, look in the source for:

    <body link="#111111" vlink="#111111" alink="#000000" onload="loadData('dataID', '/load_elite_1329825631.html');">

    the "load_elite_numbers.html" is what you want to scrape.

    THEN

    once you get that,

    load the site: proxysite_dot_com/load_elite_1329825631.html (replace the numbers...)

    then parse that for your ips.
     
  14. confined

    confined Regular Member

    Joined:
    Jan 4, 2009
    Messages:
    216
    Likes Received:
    91
    • Thanks Thanks x 1
  15. MarketerX

    MarketerX Regular Member

    Joined:
    Mar 7, 2010
    Messages:
    398
    Likes Received:
    120
    Yeah, exactly what Cynikal noticed and I didn't because I am trying to scrape a bunch of sites and was just going to skip this one because it was giving me problems.

    Problem solved, but let's see if he can help me on http://seprox.ru/en/proxy_filter/0_0_0_0_0_0_0_0_0_0.html

    Or anyone :p This one should be harder.
     
  16. fuserleer

    fuserleer BANNED BANNED

    Joined:
    Sep 22, 2011
    Messages:
    92
    Likes Received:
    466
    if you are working in Java, get HTML Unit

    It has a Javascript processor built in, then you can travel the resulting DOM as if you were looking at it in Firebug or whatever.
     
    • Thanks Thanks x 1
  17. fuserleer

    fuserleer BANNED BANNED

    Joined:
    Sep 22, 2011
    Messages:
    92
    Likes Received:
    466
    ooops, I didnt see this was in .NET....ignore me, sorry!
     
  18. MarketerX

    MarketerX Regular Member

    Joined:
    Mar 7, 2010
    Messages:
    398
    Likes Received:
    120
    It's ok, I appreciate the response.

    If anyone can lead me in the right direction towards scraping the list on this page http://seprox.ru/en/proxy_filter/0_0_0_0_0_0_0_0_0_0.html

    I would be extremely grateful..I want to do it for the learning experience since they went to measures to prevent people from scraping their list. The first one was actually an easy solution, this one should be more difficult.
     
  19. MarketerX

    MarketerX Regular Member

    Joined:
    Mar 7, 2010
    Messages:
    398
    Likes Received:
    120
    I figured it out :p