Discussion in 'Black Hat SEO' started by azeds, Jun 12, 2015.
As the title says, how would a website running a forum software avoid being scrapped?
not possible, maybe try to get your forum threads visible to registered members only
May depend on how lazy the scraper is - if they use wget or HTTrack (I think) and don't change teh config you will see default useragents. Block those from your forum. However, anyone worth their salt will put real looking UA strings in to stop that.
Again, you can sometimes judge if something is a scraper by the speed of it hitting the forum. How many humans do you know that can read 14 threads in 3 seconds? Put rate limiting in place and although you won't stop it happening you can piss them off by slowing them down a lot
Other than that, take your website offline. That will stop them
Run it on localhost only. This is the only way
I doubt you can avoid getting a forum scrapped all the content is visible to the general public and it's hard for you to keep em down. Alternatively you could get a programmer to program you a software where people can view your content but aren't able to left click on the text to copy it.
do what bhw does, that you can only read a few threads without being a registered member, which was recommended by CoolAdvisor above, then do what fatboy said and force them to fill out a captcha request every time they hit the limit, put some difficult captcha, like google's no-captcha recaptcha or video captcha which cannot be easily solved by captcha solving services
also make it that an ip only can be used only once to register, blacklist free ips (public proxies, free vpns, tor etc.), so make it close to impossible to register a lot of accs without them spending a dime
that would be more than enough to piss them off
hey u! i want to contact u to talk about Youtube. do u have skype or fb? i'm VN
I don't think it is possible for a forum to not get scraped.
Sometimes if you change the visibility of the threads and post in the forum to REGISTERED Members only it might work .
It involves some custom code but it can be done. Here's what I'd do to beat 99% of scrapers out there:
1. Break your page structure randomly: Have your PHP code insert random divs on a random basis that don't do anything, this will break most scrapers and they won't even know why.
2. Rotate CSS classes: If you know which parts of your forum are likely to be targeted, e.g the comments, have your PHP rotate the occasional CSS class (of course you'd need to duplicate your CSS tags to prevent your page layout from breaking), this will break the remaining scrapers.
3. Rotate the id's of your tags. This is annoying but it depends on how determined you are.
I'd say this will stop 100% but nothing is certain.
Just to clarify, the reason random rotation of page and tag features works is because scrapers rely upon predetermined, reliable page structure and tag names in order to extract the desired information. Rather than playing a cat and mouse game of trying to block them, just make your information too difficult to extract and they'll go away of their own accord wondering where they're going wrong.
Those are good ideas, but I'm not sure about #1. #1 could affect users?
I managed to get myself blocked from a forum recently, they 403'd me, but the amazing thing is that i can't seem to view the site anymore, not even from a proxy. I thought i had broke the site due to scraping, as i can't even get external website thumbnail generators to show anything else but the 403 error, and yet when i do a search in google last 24hours there are threads with new dates.
Users would be unaffected, a div tag that opens at the top and closes thereafter, doesn't make a difference, it can even be set to not display....remember it's just the HTML code we're trying to alter.
They probably got a watch dog spying on you.
Trained to bark whenever you try and do something with their site.
I've never seen a live case of it but it's possible that they've fingerprinted your browser (it's what I would do) using identifiers like plugins, fonts etc, switch browsers in addition to IP address and see how it goes.
I like your ideas. I'm trying to visualize the div though. If it's a hidden div, wouldn't the scraper then have a hidden div so their site would be unaffected? I think the other ideas are good.
They would but you're making that div appear on random page loads in different places, making any xpaths or positional scraping methods useless. They'll design their scraper, run it, only to find it doesn't work...redo the scraper but, oh no the page has changed again!
The coding logic for this would be something like this: on page load, generate random number between 0 and 3, if number is 2 then display div, else don't.
Have 4-5 of those randomly appearing div's in the right places and you're good to go. The scraper works on the HTML code rather than the appearance of the website, so whether you see the visual effects of the div is immaterial.
Why didn't I think of this before? Hell this doesn't just work against scrapers it could be used to generate 'fresh' content.
PBN sites could change daily.
Separate names with a comma.