1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How would a forum AVOID being scrapped?

Discussion in 'Black Hat SEO' started by azeds, Jun 12, 2015.

  1. azeds

    azeds Junior Member

    Joined:
    Jan 17, 2013
    Messages:
    157
    Likes Received:
    46
    As the title says, how would a website running a forum software avoid being scrapped?
     
  2. CoolAdvisor

    CoolAdvisor Senior Member

    Joined:
    Mar 24, 2008
    Messages:
    1,010
    Likes Received:
    371
    not possible, maybe try to get your forum threads visible to registered members only
     
  3. fatboy

    fatboy Elite Member

    Joined:
    Aug 13, 2008
    Messages:
    1,618
    Likes Received:
    3,229
    Occupation:
    Retired
    Location:
    Old Peoples Home
    May depend on how lazy the scraper is - if they use wget or HTTrack (I think) and don't change teh config you will see default useragents. Block those from your forum. However, anyone worth their salt will put real looking UA strings in to stop that.

    Again, you can sometimes judge if something is a scraper by the speed of it hitting the forum. How many humans do you know that can read 14 threads in 3 seconds? Put rate limiting in place and although you won't stop it happening you can piss them off by slowing them down a lot :)

    Other than that, take your website offline. That will stop them :)
     
  4. Huy Phan

    Huy Phan Jr. VIP Jr. VIP

    Joined:
    Sep 3, 2012
    Messages:
    1,001
    Likes Received:
    571
    Home Page:
    Run it on localhost only. This is the only way :p
     
  5. AquaticGamer

    AquaticGamer Jr. VIP Jr. VIP

    Joined:
    Apr 13, 2013
    Messages:
    4,524
    Likes Received:
    1,626
    Gender:
    Male
    Location:
    http://InstaGrowth.AQSocials.com
    Home Page:
    I doubt you can avoid getting a forum scrapped all the content is visible to the general public and it's hard for you to keep em down. Alternatively you could get a programmer to program you a software where people can view your content but aren't able to left click on the text to copy it.
     
  6. HoNeYBiRD

    HoNeYBiRD Jr. VIP Jr. VIP

    Joined:
    May 1, 2009
    Messages:
    6,954
    Likes Received:
    7,985
    Gender:
    Male
    Occupation:
    Geographer, Tourism Manager
    Location:
    Ghosted
    do what bhw does, that you can only read a few threads without being a registered member, which was recommended by CoolAdvisor above, then do what fatboy said and force them to fill out a captcha request every time they hit the limit, put some difficult captcha, like google's no-captcha recaptcha or video captcha which cannot be easily solved by captcha solving services
    also make it that an ip only can be used only once to register, blacklist free ips (public proxies, free vpns, tor etc.), so make it close to impossible to register a lot of accs without them spending a dime

    that would be more than enough to piss them off
     
  7. kneebox

    kneebox Newbie

    Joined:
    Jun 13, 2015
    Messages:
    3
    Likes Received:
    0
    hey u! i want to contact u to talk about Youtube. do u have skype or fb? i'm VN :D
     
  8. whiteblackseo

    whiteblackseo Jr. VIP Jr. VIP

    Joined:
    Apr 11, 2015
    Messages:
    2,461
    Likes Received:
    918
    Home Page:
    I don't think it is possible for a forum to not get scraped.
     
  9. BlackSpidey

    BlackSpidey Jr. VIP Jr. VIP Premium Member

    Joined:
    Aug 9, 2013
    Messages:
    199
    Likes Received:
    95
    Occupation:
    Waving
    Location:
    Web
    Sometimes if you change the visibility of the threads and post in the forum to REGISTERED Members only it might work .
     
  10. myopic1

    myopic1 Regular Member

    Joined:
    Mar 24, 2014
    Messages:
    408
    Likes Received:
    402
    It involves some custom code but it can be done. Here's what I'd do to beat 99% of scrapers out there:

    1. Break your page structure randomly: Have your PHP code insert random divs on a random basis that don't do anything, this will break most scrapers and they won't even know why.

    2. Rotate CSS classes: If you know which parts of your forum are likely to be targeted, e.g the comments, have your PHP rotate the occasional CSS class (of course you'd need to duplicate your CSS tags to prevent your page layout from breaking), this will break the remaining scrapers.

    3. Rotate the id's of your tags. This is annoying but it depends on how determined you are.

    I'd say this will stop 100% but nothing is certain.

    Just to clarify, the reason random rotation of page and tag features works is because scrapers rely upon predetermined, reliable page structure and tag names in order to extract the desired information. Rather than playing a cat and mouse game of trying to block them, just make your information too difficult to extract and they'll go away of their own accord wondering where they're going wrong.
     
    • Thanks Thanks x 2
    Last edited: Jun 13, 2015
  11. archon10

    archon10 BANNED BANNED

    Joined:
    Oct 10, 2011
    Messages:
    1,181
    Likes Received:
    1,667
    Those are good ideas, but I'm not sure about #1. #1 could affect users?
     
  12. Panther28

    Panther28 Jr. VIP Jr. VIP

    Joined:
    May 2, 2010
    Messages:
    2,548
    Likes Received:
    3,567
    Occupation:
    Internet.
    Location:
    Internet.
    Home Page:
    I managed to get myself blocked from a forum recently, they 403'd me, but the amazing thing is that i can't seem to view the site anymore, not even from a proxy. I thought i had broke the site due to scraping, as i can't even get external website thumbnail generators to show anything else but the 403 error, and yet when i do a search in google last 24hours there are threads with new dates.
     
  13. myopic1

    myopic1 Regular Member

    Joined:
    Mar 24, 2014
    Messages:
    408
    Likes Received:
    402
    Users would be unaffected, a div tag that opens at the top and closes thereafter, doesn't make a difference, it can even be set to not display....remember it's just the HTML code we're trying to alter.
     
  14. acidol2

    acidol2 Supreme Member

    Joined:
    Sep 8, 2011
    Messages:
    1,322
    Likes Received:
    835
    Location:
    My Successful Future
    They probably got a watch dog spying on you.
    Trained to bark whenever you try and do something with their site.
     
  15. myopic1

    myopic1 Regular Member

    Joined:
    Mar 24, 2014
    Messages:
    408
    Likes Received:
    402
    I've never seen a live case of it but it's possible that they've fingerprinted your browser (it's what I would do) using identifiers like plugins, fonts etc, switch browsers in addition to IP address and see how it goes.
     
    Last edited: Jun 14, 2015
  16. archon10

    archon10 BANNED BANNED

    Joined:
    Oct 10, 2011
    Messages:
    1,181
    Likes Received:
    1,667
    I like your ideas. I'm trying to visualize the div though. If it's a hidden div, wouldn't the scraper then have a hidden div so their site would be unaffected? I think the other ideas are good.
     
  17. myopic1

    myopic1 Regular Member

    Joined:
    Mar 24, 2014
    Messages:
    408
    Likes Received:
    402
    They would but you're making that div appear on random page loads in different places, making any xpaths or positional scraping methods useless. They'll design their scraper, run it, only to find it doesn't work...redo the scraper but, oh no the page has changed again!

    The coding logic for this would be something like this: on page load, generate random number between 0 and 3, if number is 2 then display div, else don't.

    Have 4-5 of those randomly appearing div's in the right places and you're good to go. The scraper works on the HTML code rather than the appearance of the website, so whether you see the visual effects of the div is immaterial.
     
    Last edited: Jun 14, 2015
  18. myopic1

    myopic1 Regular Member

    Joined:
    Mar 24, 2014
    Messages:
    408
    Likes Received:
    402
    Why didn't I think of this before? Hell this doesn't just work against scrapers it could be used to generate 'fresh' content.

    Your homepage could change every time it's loaded, a sentence here, an image there, a few div tags that keeps Google wanting to visit your page and discovering new content. It would need to be done server side though, I don't think the same 'importance' is given to Javascript generated content...but I could be wrong.

    PBN sites could change daily.