1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

what site is considered too hard to scrape ?

Discussion in 'Black Hat SEO' started by huyvun, Oct 31, 2012.

  1. huyvun

    huyvun Newbie

    Joined:
    May 1, 2012
    Messages:
    21
    Likes Received:
    2
    Occupation:
    coder
    Any thoughts on what sites are considered far to difficult to scrape..?

    Could be because the site has very good security/anti scraping technology ?

    Mix of html/Js, too obfuscated ?

    Security such as rotating cookies,, detection of browser/curl header orders etc..
     
  2. tpickett

    tpickett Newbie

    Joined:
    Feb 1, 2012
    Messages:
    30
    Likes Received:
    8
    Occupation:
    SEO *****
    Location:
    Kansas City
    Nothing is "too difficult" to scrape. Just have to know how to parse with regex. Even Googles SERPs can be scraped with PHP and cURL...
     
  3. huyvun

    huyvun Newbie

    Joined:
    May 1, 2012
    Messages:
    21
    Likes Received:
    2
    Occupation:
    coder
    well, fb for example have measures to analyze frequency of requests, (intervals too),
    path of links crawled etc,, and very quickly they detect bot or human, unless scrape software is very complex
    (i'm talking when you start scraping thousands of links,, then fb shuts you down)

    linkedin looks pretty hard to scrape..
    as well ask skype forums
     
  4. tpickett

    tpickett Newbie

    Joined:
    Feb 1, 2012
    Messages:
    30
    Likes Received:
    8
    Occupation:
    SEO *****
    Location:
    Kansas City
    anything can be done with enough proxies from the right location...;)
     
  5. huyvun

    huyvun Newbie

    Joined:
    May 1, 2012
    Messages:
    21
    Likes Received:
    2
    Occupation:
    coder
    the number of proxies is irrelevant to the question. - you're suggesting that the only defense a site has
    against scraping is IP detection...

    anyways.....
     
  6. thejake

    thejake Jr. VIP Jr. VIP Premium Member

    Joined:
    Nov 13, 2009
    Messages:
    685
    Likes Received:
    828
    The trickiest sites to scrape are the ones with strong authentication, browser-specific script behaviors and asynchronous elements. For example terapeak isn't much fun to scrape without browser automation.
     
  7. meannn

    meannn Supreme Member

    Joined:
    Apr 22, 2009
    Messages:
    1,461
    Likes Received:
    1,896
    Occupation:
    Unemployed Winner
    Location:
    TR
    Nope, if content is made with javascript, you cannot scrape. Javascript is browser based.
     
  8. huyvun

    huyvun Newbie

    Joined:
    May 1, 2012
    Messages:
    21
    Likes Received:
    2
    Occupation:
    coder
    agreed... that's what i was thinking too
    sites which pull content in via async js..
    or going a step further..
    js which downloads obfsucated js which re-assembles ( like self modifying code ), at load time,
    which in turn pulls content .. this would be almost impossible to scrape, unless you designed a custom-specific scraper ...
    even with that,, the site could make use of all sort of anti-debug tricks, like one-time cookie-pads.
    code timing,, public key layers.. etc..
    with that said.. - if a browser can display it,, then ofcourse there is a way around it.. -
     
  9. m00j99

    m00j99 Registered Member

    Joined:
    Oct 8, 2009
    Messages:
    85
    Likes Received:
    52
    so... what exactly is your question now?
     
  10. huyvun

    huyvun Newbie

    Joined:
    May 1, 2012
    Messages:
    21
    Likes Received:
    2
    Occupation:
    coder
    want opinions, on what site is considered the hardest to crack..
    for example,, if the question was - what is the hardest windows app to reverse engineer, without a doubt it would be skype ..
    ( the most hardcore anti-debugging techniques ever implemented into an app ).
    so want to know, what you guys consider the equivelant of this would be for as a site
     
  11. -ReX-

    -ReX- Power Member

    Joined:
    Apr 26, 2012
    Messages:
    707
    Likes Received:
    274
    Location:
    Manly, Australia
    Can't be that hard, some one reverse engineersed it and realised the code to grab the last ip of any user lol.
     
  12. -Jericho-

    -Jericho- Jr. Executive VIP Jr. VIP Premium Member

    Joined:
    Jan 10, 2010
    Messages:
    2,849
    Likes Received:
    1,704
    Location:
    Stalking My Ex-Wife
    NSA Servers. Bastards just don't want to let us in.