1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[GET] .htaccess code to block backlink crawlers

Discussion in 'Black Hat SEO' started by dee_emm_tee, Dec 4, 2016.

  1. dee_emm_tee

    dee_emm_tee Junior Member

    Joined:
    Oct 6, 2016
    Messages:
    130
    Likes Received:
    49
    Gender:
    Male
    When blocking crawlers from PBNs most people blacklist specific crawlers in the .htaccess file. This is a perfectly valid way of blocking unwanted crawlers from your sites, but it can leave you vulnerable. Here’s why:
    There are many small, lesser known backlink tools out there, and unless you can determine their user-agent and block it, there is a chance they will crawl your PBN. Not only this, but every time there is a new backlink tool with an unkown user agent you risk your competitors discovering your PBN.
    Rather than allowing all UA’s by default and only blacklisting certain ones, my solution to this is to disallow all UAs and only whitelist the one’s I want to be able to access my site. I haven’t seen this type of .htaccess code shared here before, so I figured I’d post it:

    Code:
    SetEnvIfNoCase User-Agent .*Googlebot.* good_agent
    SetEnvIfNoCase User-Agent .*MSIE.* good_bot
    SetEnvIfNoCase User-Agent .*Safari mobile.* good_agent
    SetEnvIfNoCase User-Agent .*Safari.* good_agent
    SetEnvIfNoCase User-Agent .*Firefox.* good_agent
    SetEnvIfNoCase User-Agent .*Firefox (BonEcho).* good_agent
    SetEnvIfNoCase User-Agent .*Firefox (GranParadiso).* good_agent
    SetEnvIfNoCase User-Agent .*Firefox (Lorentz).* good_agent
    SetEnvIfNoCase User-Agent .*Firefox (Minefield).* good_agent
    SetEnvIfNoCase User-Agent .*Firefox (Namoroka).* good_agent
    SetEnvIfNoCase User-Agent .*Firefox (Shiretoko).* good_agent
    SetEnvIfNoCase User-Agent .*Firefox mobile.* good_agent
    SetEnvIfNoCase User-Agent .*DuckDuckGo Mobile.* good_agent
    SetEnvIfNoCase User-Agent .*Chrome Mobile.* good_agent
    SetEnvIfNoCase User-Agent .*Chrome.* good_agent
    SetEnvIfNoCase User-Agent .*Opera.* good_agent
    SetEnvIfNoCase User-Agent .*Opera Mini.* good_agent
    SetEnvIfNoCase User-Agent .*Opera Mobile.* good_agent
    SetEnvIfNoCase User-Agent .*Yahoo.* good_agent
    SetEnvIfNoCase User-Agent .*bingbot.* good_agent
    SetEnvIfNoCase User-Agent .*DuckDuckBot.* good_agent
    SetEnvIfNoCase User-Agent .*UC Browser.* good_agent
    SetEnvIfNoCase User-Agent .*UC Browser mobile.* good_agent
    SetEnvIfNoCase User-Agent .*Iron.* good_agent
    SetEnvIfNoCase User-Agent .*Iron mobile.* good_agent
    SetEnvIfNoCase User-Agent .*Android browser.* good_agent
    
    <Limit GET POST HEAD>
    Order Deny, allow
    Deny from all
    Allow from env=good_agent
    </Limit>
    
    Putting this code in your .htaccess file will block all user agents your from your site by default, and only allow those user agents designated as “good_agents”.
    This means that all existing backlink crawlers with unknown user agents as well as any new ones in the future will all be given an error 403 when they try to crawl your PBN.
    Of course, User Agents are easy to spoof, so it is still possible that some backlink crawlers could crawl your site while masquerading as the Googlebot, but whatcha gonna do.
    If the code above causes a 500 internal server error, remove this line: “Order Deny, allow”.

    If anyone has any suggestions/contributions to this code I’d love to hear them!
     
    • Thanks Thanks x 1
    Last edited: Dec 4, 2016
  2. The Data Scientist

    The Data Scientist Jr. VIP Jr. VIP

    Joined:
    Nov 3, 2016
    Messages:
    214
    Likes Received:
    54
    Gender:
    Male
    Occupation:
    Data Scientist, SEO & Entrepreneur
    Location:
    WWW
    I wouldn't block IE8 because every UA can be and will be spoofed by crawlers. Apart from that this is a good approach.
     
  3. bambi

    bambi Junior Member

    Joined:
    Aug 9, 2008
    Messages:
    179
    Likes Received:
    63
    Gender:
    Female
    What line should be added to keep IE?

    Does this effect traffic coming from these sources too or is it strictly for bots?

    Thank you. :)
     
  4. dee_emm_tee

    dee_emm_tee Junior Member

    Joined:
    Oct 6, 2016
    Messages:
    130
    Likes Received:
    49
    Gender:
    Male
    Good point - I added IE to the whitelist
     
  5. dee_emm_tee

    dee_emm_tee Junior Member

    Joined:
    Oct 6, 2016
    Messages:
    130
    Likes Received:
    49
    Gender:
    Male
    This configuration will block all traffic, EXCEPT traffic that is using one of the user agents designated as "good_agents". Non-bot traffic will still be blocked if it's user agent is not specified in the whitelist.

    As per The Data Scientist's suggestion I have updated the whitelist to include Internet Explorer. Copy the updated code and you'll be good to go:)
     
  6. MuayThai

    MuayThai Jr. VIP Jr. VIP

    Joined:
    Aug 25, 2015
    Messages:
    637
    Likes Received:
    190
    Unfortunately, on my website I have noticed a lot of bots who mimic they are Googlebot.
     
  7. dee_emm_tee

    dee_emm_tee Junior Member

    Joined:
    Oct 6, 2016
    Messages:
    130
    Likes Received:
    49
    Gender:
    Male
    Yes, it sucks. Afaik there's not way around it aside from knowing the IP's of the bots you want to block.
    Neither whitelisting nor blacklisting crawlers is bulletproof, but in my opinion whitelisting is more effective.