1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to stop backlink checkers and competitor research hence your competitor from sniffing your blog

Discussion in 'Black Hat SEO' started by neteater, Jan 27, 2017.

  1. neteater

    neteater Jr. VIP Jr. VIP

    Joined:
    Feb 14, 2009
    Messages:
    566
    Likes Received:
    335
    Location:
    somewhere between CPU an d heat sink
    Home Page:
    As we all know, when we fire up Ahrefs, Majestic SEO, ... we can easy find backlinks to your competitor sites, which means that we can spot their private blog networks. How do we protect private blog networks from being detected by our competitors? There are several methods of doing that:

    + Using .htaccess file to block requests with certain user agents and IP addresses
    + Using robot.txt and hope for the crawlers to respect it.

    There are several problems with those two approaches.

    The first one .htaccess: The bot can easily change the User Agents. IP address can be harder to change but not impossible. Those backlink checking services can fire up a bunch of servers with different IP addresses in a very short time so constantly update your IP databases is difficult.

    What about using robot.txt. Who knows if the bots will respect the file. Those services are called backlink checkers and competitor research services for a reason. If we couldn't find the backlinks of our competitors, what is the point of using such services?

    Here comes our savior reverse and forward DNS lookup. The guide here will be only concerned with Google. Since Google doesn't maintain a public list of its crawler's IP addresses (the reason is obvious Google has hundreds of thousands of servers and the IP ranges can change so it is impossible for you to maintain a database of Google IP addreses). But lucily we can verify if the IP addresses really belong to Google using the host command in Linux.

    1. The first step is to do obtain the visitor's IP address, let's say a.b.c.d
    2. Then do a reverse DNS lookup using the host command: host a.b.c.d
    3. The result of this command will be a string containng host name. If this host name contains googlebot.com or google.com we can progress to the next step
    4. Do a forward DNS look of of the host name you obtained. host hostname. If result is the IP address you obtained previously. You can safely say that the visitor is Google.

    Now how can we apply this to our own sites? Luckily with PHP we can access the host command using gethostbyaddr and gethostbyname. This is the pseudo code:

    <?php
    $hostname = gethostbyaddr($_SERVER['REMOTE_ADDR']);
    if (preg_match('/\.googlebot|google\.com$/i', $hostname)) {
    if (!strcmp(gethostbyname($hostname), $_SERVER['REMOTE_ADDR'])) {
    echo "Matched";
    }
    }
    ?>


    How can we integrate it with the existing Web framework. Let's say we have a Wordpress website. There are two choices: modify the index.php file to add that code to reject the request. The second option is to set up a reverse proxy and put the code there.

    We can definitely combine all the three approaches above to hardening the server from crawled by bots.

    1. Set up the robot.txt file
    2. Set up the .htaccess file
    3. So if the request can still get through the previous two steps it will face the final barrier.

    You can use only DNS lookup too but using these steps will improve the performance of your site.