1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Blocking/stopping scrapers

Discussion in 'White Hat SEO' started by CredibleZephyre, Oct 31, 2013.

  1. CredibleZephyre

    CredibleZephyre Registered Member

    Joined:
    Jun 10, 2013
    Messages:
    99
    Likes Received:
    27
    I have a client who has had his content stolen quite a bit. I've done some research etc on most of the usual ways to attempt to prevent scraping (there's no foolproof plan but some are effective for the lazy scrapers... if that isn't redundant). But this is blackhat world and many of the users here are the kind that come up with creative solutions to problems, so does anyone have some fancy ideas or techniques to hinder scrapers efforts?
     
  2. roadhamster

    roadhamster Regular Member

    Joined:
    Mar 12, 2012
    Messages:
    341
    Likes Received:
    244
    You can script some javascript, what most of the bots/ scrapers can't/ don't execute. If a bot visits your site, the javascript won't be executed, and no content is loaded.
    Has a little drawback for visitors who have javascript disabled.
    Good luck.
     
    • Thanks Thanks x 1
  3. sags22

    sags22 BANNED BANNED

    Joined:
    Feb 14, 2013
    Messages:
    112
    Likes Received:
    4
    IP filtering and scripting a non executable java script can help keep the scrapers at bay. This prevents the content from appearing to them. CAPTCHA texts are also highly used to hinder most of the scrapers.
     
  4. Microleaves

    Microleaves Jr. VIP Jr. VIP

    Joined:
    Feb 13, 2011
    Messages:
    1,728
    Likes Received:
    1,144
    Location:
    Europe
    Home Page:
    I suggest you think regarding the javascript. You might end up with google bot not be able to read your content correctly.
     
  5. Schvamp

    Schvamp Power Member

    Joined:
    Feb 13, 2012
    Messages:
    684
    Likes Received:
    549
    Location:
    Hogwarts
    On top of my head I can think of.
    -Block known low quality IPs and remove proxy access.

    About javascript.
    Google bot shouldnt have any problem understanding it. But you might run into problem with visitors having JS disabled.
    Instead you can reverse the idea. Include a paragraph with backlinks to the original source and your brand name where it was posted.
    And then have JS that paragraph removed.

    Result?
    The user wont have any problem with the content, and if JS is activated, they wont see that extra paragraph.
    If it's not active, they will see it. But it's not that big of a deal is it? You can even set the font color to match the background.

    You wont be able to stop scrapers. But you can improve what content you give them.
     
  6. murachi

    murachi Newbie

    Joined:
    Sep 8, 2012
    Messages:
    31
    Likes Received:
    4
    Location:
    25.0000° N, 71.0000° W
    It's impossible to stop it altogether but there's a few techiniques you can use.

    As others have said, adding extra javascript can help but it may stop search engines being able to crawl your content properly.

    Something I've done in the past is, setup robots.txt, setup a page called jail.html, disallow access to the page through robots.txt then insert a link on one of your pages, hiding it with CSS and record IPs of visitors to jail.html.
    This can help to quickly identify requests from scrapers that are disregarding robots.txt

    Another technique is teergrubing, when you identify a scraper, keep his connections open for as long as physically possible without timing them out. Although this may alert them that you're on to them.
    http://en.wikipedia.org/wiki/Teergrubing
     
    • Thanks Thanks x 1
  7. TZ2011

    TZ2011 Senior Member

    Joined:
    Jun 26, 2011
    Messages:
    833
    Likes Received:
    864
    Both things that you mentioned would be considered a problematic with google. Presenting other text for users with/without javascript ? Font color to mach background color ? Looking for trouble. Google bots are not stupid, but also not smart enough to recognize your intentions and qualify methods (heavily used by spammers last X years) to be acceptable behavior and good visitor experience.
     
  8. benarata

    benarata Junior Member

    Joined:
    Oct 12, 2013
    Messages:
    186
    Likes Received:
    34
    You will never stop your sight from ever being scraped 100% of the time. If somebody wants it they will get it. But what you can do to protect his content is File a "Real" copyright registration for his work through the US Copyright Office. I do this every month I take all of the changes and new content on my all of my sites web 2.0 everything and register them as a group that way its only 35.00 for all of the work added that month on all of the sites and blogs. You can do this every three months but it needs to be done monthly to keep any of your works from slipping through the cracks and missing a deadline. When you catch someone stealing your content file a DMCA take down with scroogle MS and yahoo the take down will happen fast with a copyright registration from the US Copyright Office. Its a slam dunk and the infringe party will have a hard time clamming it was a bad take down order. I usually will not file with their IP -it fucks with them harder when the are trying to figure out why there sites have disappeared on the major search providers.
     
  9. validseo

    validseo Jr. VIP Jr. VIP Premium Member

    Joined:
    Jul 17, 2013
    Messages:
    910
    Likes Received:
    527
    Occupation:
    Professional SEO
    Location:
    Seattle, Wa
    I've written content scraping bots for fortune 500 companies for competitive intel purposes... The single best thing you can do is to put the data into images and then visibly watermark and copyright the images... most datascrapers are trying to populate fields in a database. Make it hard for them to do that cleanly.
     
  10. CredibleZephyre

    CredibleZephyre Registered Member

    Joined:
    Jun 10, 2013
    Messages:
    99
    Likes Received:
    27
    Yeah images makes it tough but that makes your content non-indexable doesn't it?
     
  11. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    12,489
    Likes Received:
    11,190
    Occupation:
    CHEAP
    Location:
    DATASETS
    Home Page:
    Create ONE invisible link from a 1 pixel image, the robots will follow the link, the humans won't. Whoever follows that link is a scraper, block them.
     
    • Thanks Thanks x 1
  12. Repulsor

    Repulsor Power Member

    Joined:
    Jun 11, 2013
    Messages:
    770
    Likes Received:
    279
    Location:
    PHP Scripting ;)
    Well, any ideas except using Javascript to load content can be scrapped even by a newbie programmer. Assured. Curl is really powerful, and easy as well to overcome those tiny little blocks you may put forward. IPs,Cookies, etc etc.

    Go with the javascript idea.
     
  13. validseo

    validseo Jr. VIP Jr. VIP Premium Member

    Joined:
    Jul 17, 2013
    Messages:
    910
    Likes Received:
    527
    Occupation:
    Professional SEO
    Location:
    Seattle, Wa
    But if google can index it then anyone can scrape it... cant have it both ways. Even if you made it so only Googlebot sees the text version... people will just scrape the data from the google cache... it is either wide open or made inconvenient in images.

    Javascript wont help you either... its easy to make bots that render javascript and perform mouseovers and clicks... it might stop a beginner, but nobody with any bot writing experience will be stopped by those measures... at the end of the day the data is either text or images and it is all communicated via HTTP... everything else is cheap parlor tricks that attempt to make people think it is more complicated than that.
     
    • Thanks Thanks x 1
    Last edited: Dec 4, 2013
  14. Skyebug77

    Skyebug77 Jr. VIP Jr. VIP

    Joined:
    Mar 22, 2012
    Messages:
    2,017
    Likes Received:
    1,423
    Occupation:
    Marketing
    Location:
    Portland,Or
    There are only ways to keep unexpierence scrapers at bay. Best way for you would be to actually use java to encrypt the content itself. But there is no stopping scraping pro's, unless you make the content exclusive like password protected. But then again this is only good till a scraper signs up and gets access to the sites data. But then you could build a protocol which only allows signed in users to access so many webrequests on the site by Username. IP blocking etc wont work.
     
    Last edited: Dec 4, 2013
  15. validseo

    validseo Jr. VIP Jr. VIP Premium Member

    Joined:
    Jul 17, 2013
    Messages:
    910
    Likes Received:
    527
    Occupation:
    Professional SEO
    Location:
    Seattle, Wa
    If it is the entire collection that has the value (meaning 75% of the collection would be totally worthless) then make a third of the pages always display the data in an image.... That way most of your site is indexable and the scrapers can't have the whole dataset.

    If the pages have integer ids in the database then id % 3 == 0 means the page data is displayed in image instead of text.
     
    Last edited: Dec 4, 2013
  16. Nigel Farage

    Nigel Farage BANNED BANNED

    Joined:
    Feb 8, 2012
    Messages:
    565
    Likes Received:
    1,495
    Have someone make you an ad-sense clicking autobot, get about 1000 legit proxies and get their ad-sense account banned.
     
  17. Corydoras007

    Corydoras007 Regular Member

    Joined:
    Sep 17, 2012
    Messages:
    356
    Likes Received:
    57
    This is actually a cool idea and think this can actually work if you're worried about your content getting scraped. THere is no foolproof way of preventing scrapers so might as well give them incomplete data.

    My only worry is how this will affect your SEO. But I like the idea!
     
  18. Tensegrity

    Tensegrity Elite Member

    Joined:
    Apr 22, 2009
    Messages:
    1,846
    Likes Received:
    976
    As a veteran screen scraper, I can tell you that there is absolutely no way to completely stop screen scrapers aside from making your site approved members only. I do all my scraping with javascript and it's so massively easy that I've made a living doing it.

    What you need to do is tell your client to stop focusing on people scraping and focus more on selling. The internet is for sharing, that is how it works, how it runs, and how it will always be. If your client focuses more on advertising and finding customers, or using their skill for something that can't be scraped (like a service or custom job), then they will be more successful.