How would a forum AVOID being scrapped?

azeds · Jun 12, 2015

As the title says, how would a website running a forum software avoid being scrapped?

CoolAdvisor · Jun 12, 2015

not possible, maybe try to get your forum threads visible to registered members only

fatboy · Jun 12, 2015

May depend on how lazy the scraper is - if they use wget or HTTrack (I think) and don't change teh config you will see default useragents. Block those from your forum. However, anyone worth their salt will put real looking UA strings in to stop that.

Again, you can sometimes judge if something is a scraper by the speed of it hitting the forum. How many humans do you know that can read 14 threads in 3 seconds? Put rate limiting in place and although you won't stop it happening you can piss them off by slowing them down a lot

Other than that, take your website offline. That will stop them

Huy Phan · Jun 12, 2015

Run it on localhost only. This is the only way

AquaticGamer · Jun 12, 2015

I doubt you can avoid getting a forum scrapped all the content is visible to the general public and it's hard for you to keep em down. Alternatively you could get a programmer to program you a software where people can view your content but aren't able to left click on the text to copy it.

HoNeYBiRD · Jun 12, 2015

do what bhw does, that you can only read a few threads without being a registered member, which was recommended by CoolAdvisor above, then do what fatboy said and force them to fill out a captcha request every time they hit the limit, put some difficult captcha, like google's no-captcha recaptcha or video captcha which cannot be easily solved by captcha solving services
also make it that an ip only can be used only once to register, blacklist free ips (public proxies, free vpns, tor etc.), so make it close to impossible to register a lot of accs without them spending a dime

that would be more than enough to piss them off

kneebox · Jun 13, 2015

Huy Phan said:
Run it on localhost only. This is the only way

hey u! i want to contact u to talk about Youtube. do u have skype or fb? i'm VN

whiteblackseo · Jun 13, 2015

I don't think it is possible for a forum to not get scraped.

BlackSpidey · Jun 13, 2015

Sometimes if you change the visibility of the threads and post in the forum to REGISTERED Members only it might work .

myopic1 · Jun 13, 2015

It involves some custom code but it can be done. Here's what I'd do to beat 99% of scrapers out there:

1. Break your page structure randomly: Have your PHP code insert random divs on a random basis that don't do anything, this will break most scrapers and they won't even know why.

2. Rotate CSS classes: If you know which parts of your forum are likely to be targeted, e.g the comments, have your PHP rotate the occasional CSS class (of course you'd need to duplicate your CSS tags to prevent your page layout from breaking), this will break the remaining scrapers.

3. Rotate the id's of your tags. This is annoying but it depends on how determined you are.

I'd say this will stop 100% but nothing is certain.

Just to clarify, the reason random rotation of page and tag features works is because scrapers rely upon predetermined, reliable page structure and tag names in order to extract the desired information. Rather than playing a cat and mouse game of trying to block them, just make your information too difficult to extract and they'll go away of their own accord wondering where they're going wrong.

archon10 · Jun 14, 2015

Those are good ideas, but I'm not sure about #1. #1 could affect users?

Panther28 · Jun 14, 2015

I managed to get myself blocked from a forum recently, they 403'd me, but the amazing thing is that i can't seem to view the site anymore, not even from a proxy. I thought i had broke the site due to scraping, as i can't even get external website thumbnail generators to show anything else but the 403 error, and yet when i do a search in google last 24hours there are threads with new dates.

myopic1 · Jun 14, 2015

archon10 said:
Those are good ideas, but I'm not sure about #1. #1 could affect users?

Users would be unaffected, a div tag that opens at the top and closes thereafter, doesn't make a difference, it can even be set to not display....remember it's just the HTML code we're trying to alter.

acidol2 · Jun 14, 2015

Panther28 said:
I managed to get myself blocked from a forum recently, they 403'd me, but the amazing thing is that i can't seem to view the site anymore, not even from a proxy. I thought i had broke the site due to scraping, as i can't even get external website thumbnail generators to show anything else but the 403 error, and yet when i do a search in google last 24hours there are threads with new dates.

They probably got a watch dog spying on you.
Trained to bark whenever you try and do something with their site.

myopic1 · Jun 14, 2015

Panther28 said:
I managed to get myself blocked from a forum recently, they 403'd me, but the amazing thing is that i can't seem to view the site anymore, not even from a proxy. I thought i had broke the site due to scraping, as i can't even get external website thumbnail generators to show anything else but the 403 error, and yet when i do a search in google last 24hours there are threads with new dates.

I've never seen a live case of it but it's possible that they've fingerprinted your browser (it's what I would do) using identifiers like plugins, fonts etc, switch browsers in addition to IP address and see how it goes.

archon10 · Jun 14, 2015

myopic1 said:
Users would be unaffected, a div tag that opens at the top and closes thereafter, doesn't make a difference, it can even be set to not display....remember it's just the HTML code we're trying to alter.

I like your ideas. I'm trying to visualize the div though. If it's a hidden div, wouldn't the scraper then have a hidden div so their site would be unaffected? I think the other ideas are good.

myopic1 · Jun 14, 2015

archon10 said:
I like your ideas. I'm trying to visualize the div though. If it's a hidden div, wouldn't the scraper then have a hidden div so their site would be unaffected? I think the other ideas are good.

They would but you're making that div appear on random page loads in different places, making any xpaths or positional scraping methods useless. They'll design their scraper, run it, only to find it doesn't work...redo the scraper but, oh no the page has changed again!

The coding logic for this would be something like this: on page load, generate random number between 0 and 3, if number is 2 then display div, else don't.

Have 4-5 of those randomly appearing div's in the right places and you're good to go. The scraper works on the HTML code rather than the appearance of the website, so whether you see the visual effects of the div is immaterial.

myopic1 · Jun 14, 2015

Why didn't I think of this before? Hell this doesn't just work against scrapers it could be used to generate 'fresh' content.

Your homepage could change every time it's loaded, a sentence here, an image there, a few div tags that keeps Google wanting to visit your page and discovering new content. It would need to be done server side though, I don't think the same 'importance' is given to Javascript generated content...but I could be wrong.

PBN sites could change daily.

How would a forum AVOID being scrapped?

Junior Member

Supreme Member

Elite Member

Senior Member

Elite Member

Elite Member

Newbie

Elite Member

Junior Member

Regular Member

BANNED

Elite Member

Regular Member

Supreme Member

Regular Member

BANNED

Regular Member

Regular Member

Main Menu

Marketplace

Making Money

BlackHat World