Good day to you all, the gloom and doom in this section ("Hit by penguin 2.1", "Hummingbird killing my sites!", "Hitchcock was right. The bloody birds are fucking up my shit!", etc) is rather depressing. So why not do something fun (and possibly very unproductive) instead? Looking at various search algorithms, their implementations and relevancy factors sounds like a fun pastime (and possibly, once again, a very unproductive pastime). But hell, for all I know, we might learn something here and that can't be a bad thing, can it? So what I'll be doing here in very simple terms is basically building various search algorithms (starting from the very naive) and then comparing my results to what our friends at Google and Bing deem to be state of the art. More engines (I'm very interested in Yandex in particular) might be added later. Disclaimer for the very obtuse: This is mostly all fun and giggles. Pretending that I (or anyone) could somehow replicate Google's or Bing's vast search architecture would be pretty ludicrous. However, it's not all as useless as it sounds right now. Building a toy search engine from scratch can be a very enlightening process as most of the techniques used are actually alive and well. Now that this is out of the way, we can start out! 1. The Problem The problem is all too well known: What is relevancy (in regards to search) and how does one extract relevant data from a common dataset? Various solutions and implementations, all successful to some degree, exist and are commonly referred to/used. Now one topic we should touch upon immediately is the representation of data. In order to asses relevancy in a comfortable manner both the document (here a website or a webpage) as well as the larger dataset (Google's crawl index for example) need to be represented by one or various (preferably mathematical) models. Why numbers if we are mostly working with text? Well, in simple terms, numbers are easier to work with. Thus the problem can be restated as: What is relevancy (this stays the same for now), what is the optimal collection of models for data representation, which tools are most applicable to these models? Of course this is mostly a huge generalization. If it was all about models, replicating Google's algorithms would be a feasible (albeit somewhat mundane) procedure. Sadly, when it comes to search engines, many other factors are involved (which we will hopefully touch upon in the near future). However, it all boils down to numbers in the end. Numbers which realistically are out of our reach. Anyhow, back to the problem at hand. We can further simplify the problem by remembering that we are building a practical implementation here. Mainly, we aren't interested in the general problem of search relevancy, rather we are interested in the problem of search relevancy as applied by Google or Bing. So we can once again restate the problem as: What is relevancy, what combination of model collections and tools is used by the search giants? 2. The Approach I personally prefer a practical approach here, in essence I'll be building various little search engines. These will be supported by various external tools, like scrapers (to compare results), crawlers (get our own little index together), stat tools, etc. All code will be publicly available and I'll most probably setup a repo later on. I do not claim any copyright whatsoever for any code written in the process. Do whatever you want with it. Use it all or just pick some parts for your projects. Besides the practical implementation, background knowledge of various algorithms and models will be required. I will personally use and discuss the algorithms I'm most familiar with myself. My knowledge is of course limited, so if you have suggestions, corrections, etc feel free to post them here. Or just drop me a PM. Now, all code will be modular in design as we will most probably need to refactor a ton of it in the process. The tools mentioned above should be as stand-alone as possible. Again, feel free to reuse them if you like. Language wise, I feel like lisp (and by lisp I mean cl http://en.wikipedia.org/wiki/Common_Lisp) is a good contender here. I'll most probably use Python for some side tasks as well. This could obviously be implemented in the algol family of languages as well or for that matter in any Turing complete language. However, lisp lends itself specifically well to exploratory problems which don't present an obvious/clear solution. Plus, I'll be able to avoid a bunch of boilerplate code as I'm comfortable in the language. Anyhow, thats the introduction. We'll start exploring some of the algorithms/factors in my next post. Cheers everybody.