1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Backlink search engine

Discussion in 'Black Hat SEO Tools' started by Subsonic, Apr 26, 2013.

  1. Subsonic

    Subsonic Regular Member

    Joined:
    Mar 17, 2011
    Messages:
    367
    Likes Received:
    333
    Location:
    DNS root zone database
    Hi all,

    I just saw this thread talking about a dedicated search engine for quality backlinking opportunities. I was going to answer to that thread but then my post became so lengthy that I decided to create a new thread. Just to let you know, this is by no means my idea. All respect to cooooookies for coming up with this idea.

    So the basic idea was to have a search engine which could be used to search for blog and forums posts that allow commenting, have high enough pr, relevant content, not too many spam links etc.

    To make that kind of search engine worth implementing I think it could index/analyze some additional SEO related information about the websites. At least something like language, activity score (how actively is the site updated), link amount (is the site already spammed to death by low quality links or is it really fresh and clean?), follow/nofollow etc. Creating something like that is certainly possible but it requires knowledge, server(s) and money. I might actually be interested in taking this idea further. I have experience in developing and hosting web crawlers, indexers and search engines on my own servers (nothing SEO related but search engines are search engines so it's relevant experience).

    I actually have a pretty powerful dedicated server (1Gbps uplink, 30TB bandwidth) sitting idle as we speak. It's a good start. But how would I actually implement a full blown search engine service like that? Read on!


    How can I crawl the websites?
    We're talking about lots and lots and lots of requests running non-stop so interfacing Google or any other service is not possible. We have to use our own crawler. Apache Nutch is a perfect open source web crawler project which can be directly used for this project. It'll do the crawling for me (respecting robot.txt's, there's no point getting blacklisted for crawling stuff that shouldn't be crawled).


    How can I index the websites crawled by my spider?
    Okay, now that I have something crawling the web non-stop I need to build a database from the crawl results so there's something to search from. For this I would use ElasticSearch, an open source search engine system with really good indexing and querying features. I just feed all the data from Nutch directly to ElasticSearch and it takes care of the indexing. Building something like this from ground up would be impossible, why re-invent the wheel?


    Okay, but how can I effectively implement the actual search engine?
    The funny thing is that by using Nutch and ElasticSearch we're actually almost done (well not technically but in idea level). ElasticSearch also provides tools for making searches (query the indexed data) and it supports very complex queries so the search engine is just a web application which routes the search queries to ElasticSearch and shows the results visually in a sortable, paginated list. If you have used search engines before you know what I mean, hehe.


    One last thing, how to know which websites to crawl?
    This question is related to the first one, if we have a crawler it has to know where to crawl! I would start with blog ping services. They provide me with a list of freshly updated blogs so I could easily grab the updated list periodically and crawl the updated url's. At first there's no point crawling all the urls because it would require shitloads of power, for example http://weblogs.com/ just listed almost 3 MILLION updated blogs from the last hour. For the prototype I would take a subset of that data and send the urls to the crawler. This way the index would slowly but surely get bigger ang more useful.


    The whole idea and implementation might sound really simple now but I'm aware that there are quite a few problems still left to be considered. For example the crawler in its basic form wouldn't check PR of the pages, follow/nofollow and other stuff so there's something that needs some custom programming work. Not a problem really but it means more time for development before even a prototype is out.

    That's it for now. I would love to hear your feedback/ideas on this. Would there be market for a SEO related search engine and if yes, what else should the engine do? If you are not interested in sharing the ideas publicly feel free to pm me. I'm not saying that I would even start to make plans for this kind of service, but right now I'm strongly considering it. Oh and feel free to steal my tips for setting up a search engine if you feel like doing this yourself. Just remember to give credit to cooooookies
     
    Last edited: Apr 26, 2013