1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Blocking Link Crawler Bots Via HTAccess with Apache Webserver is Mostly Futile

Discussion in 'Black Hat SEO' started by cottonwolf, Oct 9, 2015.

  1. cottonwolf

    cottonwolf Regular Member

    Joined:
    Jan 20, 2015
    Messages:
    469
    Likes Received:
    239
    You've read all the recommendations and confusing .htaccess file block rules about blocking link crawlers like ahrefs, majestic and OSE is an effective way to block your competitors and the link crawlers from learning about your backlinks and prevent them from storing your links in their databases.

    From my own observations over 3 months on a new amazon affiliate site, these crawlers barely pass their true user agent and when they do, they get blocked when apache sees them if you've got the proper deny rules in place.
    Look at these:
    The site's log file I'm pulling these off is there since the site's beginning and is 495MB. So don't think I'm just coming out of thin air with my assumptions.
    Code:
    -rw-r--r-- 1 www-data    root        495M Oct  9 10:04 wordpress.access.log
    
    The linux command I'm using to filter out these lines is "cat /path/to/wordpress.access.log | grep -v '1.2.3.4\|5.6.7.8.\|2.2.2.2' | grep majestic", "cat /path/to/wordpress.access.log | grep -v 'certain ips of mine to avoid these appearing' | grep ahrefs" and ""cat /path/to/wordpress.access.log | grep -v 'certain ips of mine to avoid these appearing' | grep roger". There is no rogerbot, rogerBot, opensiteexplorer or seomoz or moz.com appearing in my log file. I based rogerbot queries on http://www.botopedia.org/user-agent-list/crawlers/item/369-rogerbot-seomoz.
    Ahrefs: 403 codes mean they've been blocked by apache. You only see ahrefs user agent when it asks for robots.txt. I think when it gets 403 after requesting a domain's robots.txt, it makes a note of this in ahrefs.com database and mark that domain as to crawl it with fake user agent to prevent being 403-d. Must not be difficult with ahrefs' abilities to do this. This is speculation of course.
    Code:
    188.165.15.239 - - [07/Aug/2015:14:45:36 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.002 "219" "0.002"
    151.80.31.130 - - [09/Aug/2015:15:22:40 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.002 "219" "0.002"
    151.80.31.114 - - [15/Aug/2015:04:06:10 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.002 "219" "0.002"
    188.165.15.32 - - [18/Aug/2015:18:40:21 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.003 "219" "0.003"
    188.165.15.132 - - [27/Aug/2015:05:38:54 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.001 "219" "0.001"
    188.165.15.10 - - [30/Aug/2015:20:18:55 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.003 "219" "0.003"
    188.165.15.195 - - [06/Sep/2015:14:41:59 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.003 "219" "0.003"
    188.165.15.230 - - [10/Sep/2015:00:22:28 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.002 "219" "0.002"
    188.165.15.108 - - [16/Sep/2015:15:27:33 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.002 "219" "0.002"
    151.80.31.138 - - [21/Sep/2015:09:48:32 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.001 "219" "0.001"
    151.80.31.134 - - [26/Sep/2015:22:39:35 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.002 "219" "0.002"
    151.80.31.141 - - [05/Oct/2015:19:56:35 +0100] "GET /robots.txt HTTP/1.1" 403 199 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "-" 174 0.003 "219" "0.003"
    
    Look at majestic's requests. Majestic seems to pull of a similar stunt to avoid being blocked.
    Code:
    216.107.155.114 - - [07/Aug/2015:22:08:56 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    89.163.148.58 - - [08/Aug/2015:09:13:55 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    195.154.187.115 - - [14/Aug/2015:18:12:59 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    198.27.82.152 - - [16/Aug/2015:22:57:10 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    108.59.8.80 - - [20/Aug/2015:18:31:47 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    46.165.197.142 - - [31/Aug/2015:19:35:06 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    62.212.73.211 - - [02/Sep/2015:20:52:23 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    93.63.88.184 - - [04/Sep/2015:10:02:21 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    209.126.119.146 - - [04/Sep/2015:10:15:16 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    136.243.5.87 - - [05/Sep/2015:03:17:38 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    198.27.66.185 - - [05/Sep/2015:11:25:00 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    198.27.66.185 - - [06/Sep/2015:05:13:10 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    31.204.152.243 - - [06/Sep/2015:14:00:41 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    77.248.252.113 - - [07/Sep/2015:10:23:40 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    162.210.196.97 - - [07/Sep/2015:19:04:57 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 183 0.004 "219" "0.004"
    178.202.133.84 - - [08/Sep/2015:03:27:49 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    199.58.86.209 - - [08/Sep/2015:14:26:08 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    144.76.8.132 - - [09/Sep/2015:12:44:34 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    199.58.86.206 - - [09/Sep/2015:21:58:49 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 183 0.008 "219" "0.008"
    130.180.77.74 - - [10/Sep/2015:12:40:15 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    162.210.196.130 - - [13/Sep/2015:06:09:03 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 183 0.003 "219" "0.003"
    69.30.231.66 - - [13/Sep/2015:08:49:19 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    46.4.116.197 - - [13/Sep/2015:19:09:26 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    199.58.86.209 - - [14/Sep/2015:08:54:45 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 183 0.002 "219" "0.002"
    199.58.86.209 - - [14/Sep/2015:08:54:47 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    89.163.148.58 - - [14/Sep/2015:10:01:56 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    62.210.107.201 - - [14/Sep/2015:17:56:31 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    85.178.72.55 - - [14/Sep/2015:20:19:08 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    5.9.111.70 - - [14/Sep/2015:22:01:10 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    199.217.112.248 - - [16/Sep/2015:06:44:38 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.006 "219" "0.006"
    62.210.169.2 - - [16/Sep/2015:13:45:32 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    136.243.5.87 - - [16/Sep/2015:14:18:59 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    89.163.148.58 - - [17/Sep/2015:06:34:09 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    142.4.214.124 - - [17/Sep/2015:11:42:28 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    46.4.120.3 - - [17/Sep/2015:22:29:28 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    195.154.187.115 - - [18/Sep/2015:07:52:48 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    199.58.86.206 - - [18/Sep/2015:09:35:09 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    83.149.126.98 - - [18/Sep/2015:15:02:57 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    162.210.196.130 - - [19/Sep/2015:00:16:50 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    83.142.233.105 - - [19/Sep/2015:21:46:30 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    209.126.119.20 - - [19/Sep/2015:21:56:39 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    192.99.8.112 - - [20/Sep/2015:01:08:40 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    212.83.177.193 - - [20/Sep/2015:16:47:03 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    85.25.100.194 - - [20/Sep/2015:23:50:46 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    199.58.86.209 - - [21/Sep/2015:03:49:52 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    69.30.210.242 - - [23/Sep/2015:03:16:39 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    199.58.86.206 - - [23/Sep/2015:05:58:08 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    199.58.86.209 - - [24/Sep/2015:03:12:43 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    209.126.119.20 - - [27/Sep/2015:03:41:35 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    162.210.196.129 - - [27/Sep/2015:07:14:40 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    69.30.210.242 - - [28/Sep/2015:00:24:03 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.002 "219" "0.002"
    178.202.133.84 - - [28/Sep/2015:17:57:32 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.013 "219" "0.013"
    69.30.205.218 - - [30/Sep/2015:06:31:18 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.009 "219" "0.009"
    69.30.221.250 - - [02/Oct/2015:05:35:25 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.003 "219" "0.003"
    198.245.62.10 - - [04/Oct/2015:06:32:09 +0100] "GET /robots.txt HTTP/1.0" 403 219 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" "-" 179 0.004 "219" "0.004"
    
    No signs of rogerbot/opensiteexplorer/moz.com. They still have 200 hundred links of my recently crawled lol.

    Ahrefs and majestic had info on my links just after a few weeks of the site being registered and built, even though day were blocked around day 1 or week1.

    I think blocking these link crawlers don't work anymore, but it's still nice to block them/many other scrapers off your site, just in case. Make whatever conclusions you want, I'm not here to argue with anybody.

    Just to somewhat transparent here in a limited way, ahrefs and majestic got thousands of my links, even if some of these are sitewide recent comment links.

    You can also try to look up their ips and block those ip ranges from your site, but careful, as this may backfire on you. I had server errors when I made syntax mistakes in .htaccess.

    My current htaccess is the following for blocking badbots with user agents:
    Code:
    SetEnvIfNoCase User-Agent .*rogerbot.* bad_bot
    SetEnvIfNoCase User-Agent .*exabot.* bad_bot
    SetEnvIfNoCase User-Agent .*mj12bot.* bad_bot
    SetEnvIfNoCase User-Agent .*dotbot.* bad_bot
    SetEnvIfNoCase User-Agent .*gigabot.* bad_bot
    SetEnvIfNoCase User-Agent .*ahrefsbot.* bad_bot
    SetEnvIfNoCase User-Agent .*sitebot.* bad_bot
    SetEnvIfNoCase User-Agent .*semrushbot.* bad_bot
    SetEnvIfNoCase User-Agent .*ia_archiver.* bad_bot
    SetEnvIfNoCase User-Agent .*searchmetricsbot.* bad_bot
    SetEnvIfNoCase User-Agent .*seokicks-robot.* bad_bot
    SetEnvIfNoCase User-Agent .*sistrix.* bad_bot
    SetEnvIfNoCase User-Agent .*lipperhey spider.* bad_bot
    SetEnvIfNoCase User-Agent .*ncbot.* bad_bot
    SetEnvIfNoCase User-Agent .*backlinkcrawler.* bad_bot
    SetEnvIfNoCase User-Agent .*archive.org_bot.* bad_bot
    SetEnvIfNoCase User-Agent .*meanpathbot.* bad_bot
    SetEnvIfNoCase User-Agent .*pagesinventory.* bad_bot
    SetEnvIfNoCase User-Agent .*aboundexbot.* bad_bot
    SetEnvIfNoCase User-Agent .*spbot.* bad_bot
    SetEnvIfNoCase User-Agent .*linkdexbot.* bad_bot
    SetEnvIfNoCase User-Agent .*nutch.* bad_bot
    SetEnvIfNoCase User-Agent .*blexbot.* bad_bot
    SetEnvIfNoCase User-Agent .*ezooms.* bad_bot
    SetEnvIfNoCase User-Agent .*scoutjet.* bad_bot
    SetEnvIfNoCase User-Agent .*majestic-12.* bad_bot
    SetEnvIfNoCase User-Agent .*majestic-seo.* bad_bot
    SetEnvIfNoCase User-Agent .*dsearch.* bad_bot
    SetEnvIfNoCase User-Agent .*blekkobo.* bad_bot
    SetEnvIfNoCase User-Agent .*screaming frog seo spider/*.* bad_bot
    SetEnvIfNoCase User-Agent .*PHPCrawl.* bad_bot
    SetEnvIfNoCase User-Agent .*gocrawl.* bad_bot
    SetEnvIfNoCase User-Agent .*DigExt.* bad_bot
    SetEnvIfNoCase User-Agent .*DomainSONOCrawler.* bad_bot
    SetEnvIfNoCase User-Agent .*TweetmemeBot.* bad_bot
    SetEnvIfNoCase User-Agent .*OpenHoseBot/2.1.* bad_bot
    SetEnvIfNoCase User-Agent .*Kraken/0.1.* bad_bot
    SetEnvIfNoCase User-Agent .*-Java-.* bad_bot
    SetEnvIfNoCase User-Agent .*ubermetrics.* bad_bot
    SetEnvIfNoCase User-Agent .*best-seo.* bad_bot
    SetEnvIfNoCase User-Agent .*Synapse.* bad_bot
    SetEnvIfNoCase User-Agent .*Harvest.* bad_bot
    SetEnvIfNoCase User-Agent .*Harvester.* bad_bot
    SetEnvIfNoCase User-Agent .*harvester.* bad_bot
    SetEnvIfNoCase User-Agent .*harvest.* bad_bot
    
    <Limit GET POST HEAD> 
    
    Order Allow,Deny 
    
    Allow from all 
    
    Deny from env=bad_bot 
    
    </Limit>
    
    I'm not recommending you paste these into your htaccess file and I'm not offerring any pm/forum support. Use google for bhw's sake.
    
    
    This should protect your site from some comment spammers posting directly to /wp-comments-post.php without referrer. Doesn't block fast poster on scrapebox because I've tried. It seem scrapebox passes some sort of referrer, which only they could confirm anyway. You could always try to block scrapebox's default user agents to prevent every Bob and Abhay with a copy of scrapebox from spamming you.

    If you get firefox's web dev tools and block referrers under the disable tab, and make a comment on your site, you'll be redirected to meatspin.com. I've tried that couple of times for testing. The video doesn't play without flash, so you can try it out. Swap yourdomain.com for yourdomain.tld or meatspin.com for whatever you want.
    Code:
    RewriteEngine On
    RewriteCond %{REQUEST_METHOD} POST
    RewriteCond %{REQUEST_URI} .wp-comments-post\.php*
    RewriteCond %{HTTP_REFERER} !.*yourdomain.com.* [OR]
    RewriteCond %{HTTP_USER_AGENT} ^$
    RewriteRule (.*) http://meatspin.com/ [R=301,L]
    
     
  2. sashilover

    sashilover Junior Member

    Joined:
    Feb 4, 2015
    Messages:
    105
    Likes Received:
    12
    Great information, even I don't 100% understand.
    Thank you for sharing this.
     
  3. cottonwolf

    cottonwolf Regular Member

    Joined:
    Jan 20, 2015
    Messages:
    469
    Likes Received:
    239
    I don't think they need to visit your site anyway to store your backlinks anymore. I mean, why would they do that? Yes, they follow links, but they can also store a link on somebody else's site pointing to yours without visiting your site and getting blocked.

     
  4. samrox

    samrox Registered Member

    Joined:
    Jun 20, 2015
    Messages:
    73
    Likes Received:
    7
    Gender:
    Male
    This is a great idea to block all those back link checking tool ever, i have same experience, because we do hard to get qualified link to our site, after while competitors are tracking our new links and try to get them too. that' the reason you need to block all of them :D
     
  5. LuRk™

    LuRk™ Registered Member

    Joined:
    Jan 4, 2013
    Messages:
    59
    Likes Received:
    3
    Are you saying the .htaccess file is not blocking the links to your site or it is not blocking the external links on your site?

    The .htaccess trick is not meant to block your backlinks. It is meant to block the crawlers from finding your external links. This means it is for your PBN not your money site.
     
  6. rivered

    rivered Jr. VIP Jr. VIP

    Joined:
    Mar 6, 2015
    Messages:
    130
    Likes Received:
    35
    As mentioned in the post above, it is meant to block the bots from crawling sites that have links pointing to your money site.

    This is obviously important for PBNs, especially PBNs that point to multiple money sites in different niches. This keeps competitors from copying your niches or reporting your links to Google.

    That being said, I never had success with the .htaccess trick. My links were still being crawled somehow and showing up.

    The only thing I've had success with is using either the Link Privacy or Spyder Spanker WP plugins.
     
    Last edited: Oct 9, 2015
  7. cottonwolf

    cottonwolf Regular Member

    Joined:
    Jan 20, 2015
    Messages:
    469
    Likes Received:
    239
    I think they just crawl your site with fake user agents after they first discover that they got a 403 from a domain's robots.txt, so the block is not so effective anymore.

    I assume the result is they find your backlinks and can crawl your site for your external links. Even wget can fake its user agent, so something like ahrefs, majestic and moz should be able to do this as well.

    I assume they are faking their user agents, because I'd see them requesting sub-urls if they were to crawl mydomain.com/some-sub-url for my site's external links.

     
  8. judaculla

    judaculla Jr. VIP Jr. VIP

    Joined:
    Oct 11, 2014
    Messages:
    337
    Likes Received:
    124
    Location:
    USA
    Let me pose an argument on premise:

    Crawlers like ahrefs, moz, semrush, and etc. are useful in that they catalog external links to your site. You can do nothing to influence their finding links to your site. The only power of influence you have is what they see once on your site, which won't make any difference if you're entire goal is to protect links to your site.

    Your mistake, if I may be so bold—is your use of this type of code on your money site. You should be using it on your PBN sites which link to your money site. That way, crawlers like ahrefs and etc. aren't given permission to crawl your PBN sites, which will (still a big maybe) omit them from the backlink profile for your money site.

    I tend to agree with your impression that services such as ahrefs, and probably gxxgle as well, all have secondary measures to use for sites which give 4XX responses when crawling.

    The one advantage I'd see in adding this type of blocking to your money site would be in blocking bots such as the wayback machine, or similar bots that catalog the presence of your site. Of course, this would probably be more practical for PBN sites as well, so you could effectively sell sites after you've burned through their respectability.
     
  9. hentaiixxx

    hentaiixxx Registered Member

    Joined:
    Jan 2, 2014
    Messages:
    92
    Likes Received:
    9
    there should be a thread or service to prevent spammer bots and upgrade your site's security to prevent people from copying your site or hacking it,
    I'd definately buy the service!