The hitchhiker's guide to relevancy

Discussion in 'Black Hat SEO' started by Gophering, Oct 9, 2013.

  1. Gophering

    Gophering Junior Member

    Mar 21, 2013
    Likes Received:
    Good day to you all,

    the gloom and doom in this section ("Hit by penguin 2.1", "Hummingbird killing my sites!", "Hitchcock was right. The bloody birds are fucking up my shit!", etc) is rather depressing. So why not do something fun (and possibly very unproductive) instead? Looking at various search algorithms, their implementations and relevancy factors sounds like a fun pastime (and possibly, once again, a very unproductive pastime). But hell, for all I know, we might learn something here and that can't be a bad thing, can it?

    So what I'll be doing here in very simple terms is basically building various search algorithms (starting from the very naive) and then comparing my results to what our friends at Google and Bing deem to be state of the art. More engines (I'm very interested in Yandex in particular) might be added later.

    Disclaimer for the very obtuse: This is mostly all fun and giggles. Pretending that I (or anyone) could somehow replicate Google's or Bing's vast search architecture would be pretty ludicrous. However, it's not all as useless as it sounds right now. Building a toy search engine from scratch can be a very enlightening process as most of the techniques used are actually alive and well.

    Now that this is out of the way, we can start out!

    1. The Problem

    The problem is all too well known: What is relevancy (in regards to search) and how does one extract relevant data from a common dataset? Various solutions and implementations, all successful to some degree, exist and are commonly referred to/used. Now one topic we should touch upon immediately is the representation of data. In order to asses relevancy in a comfortable manner both the document (here a website or a webpage) as well as the larger dataset (Google's crawl index for example) need to be represented by one or various (preferably mathematical) models. Why numbers if we are mostly working with text? Well, in simple terms, numbers are easier to work with.

    Thus the problem can be restated as: What is relevancy (this stays the same for now), what is the optimal collection of models for data representation, which tools are most applicable to these models?

    Of course this is mostly a huge generalization. If it was all about models, replicating Google's algorithms would be a feasible (albeit somewhat mundane) procedure. Sadly, when it comes to search engines, many other factors are involved (which we will hopefully touch upon in the near future). However, it all boils down to numbers in the end. Numbers which realistically are out of our reach.

    Anyhow, back to the problem at hand. We can further simplify the problem by remembering that we are building a practical implementation here. Mainly, we aren't interested in the general problem of search relevancy, rather we are interested in the problem of search relevancy as applied by Google or Bing. So we can once again restate the problem as:

    What is relevancy, what combination of model collections and tools is used by the search giants?

    2. The Approach

    I personally prefer a practical approach here, in essence I'll be building various little search engines. These will be supported by various external tools, like scrapers (to compare results), crawlers (get our own little index together), stat tools, etc. All code will be publicly available and I'll most probably setup a repo later on. I do not claim any copyright whatsoever for any code written in the process. Do whatever you want with it. Use it all or just pick some parts for your projects.

    Besides the practical implementation, background knowledge of various algorithms and models will be required. I will personally use and discuss the algorithms I'm most familiar with myself. My knowledge is of course limited, so if you have suggestions, corrections, etc feel free to post them here. Or just drop me a PM.

    Now, all code will be modular in design as we will most probably need to refactor a ton of it in the process. The tools mentioned above should be as stand-alone as possible. Again, feel free to reuse them if you like.

    Language wise, I feel like lisp (and by lisp I mean cl is a good contender here. I'll most probably use Python for some side tasks as well. This could obviously be implemented in the algol family of languages as well or for that matter in any Turing complete language. However, lisp lends itself specifically well to exploratory problems which don't present an obvious/clear solution. Plus, I'll be able to avoid a bunch of boilerplate code as I'm comfortable in the language.

    Anyhow, thats the introduction. We'll start exploring some of the algorithms/factors in my next post.

    Cheers everybody.
    • Thanks Thanks x 2
  2. Hinkys

    Hinkys Jr. VIP Jr. VIP

    Mar 3, 2012
    Likes Received:
    Great idea, sounds like a fun thing to do and not as unproductive as one might think as playing with your own search engine should give you a valuable insight on how Google works. If you plan a career in SEO this is a great idea!

    Implementation on the other hand sounds like an huge mountain of work but if you're willing to work for it then by all means go right ahead.

    Maybe you could try to build one of them to find out what is the best way to implement social signals as a ranking factor. This knowledge would no doubt help you build more effective SEO campaigns in the future when google decides to pay even more attention to social signals.

    Anyway, good luck, I'll follow this thread on a regular basis. ;)
  3. 0scarmik3

    0scarmik3 Newbie

    Aug 13, 2013
    Likes Received:
    U mad bro! But we're with you on this! Just ensure that the variables are defined and closely monitored so we'll get your results and accurate for comparison with each improving versions. :)
  4. Gophering

    Gophering Junior Member

    Mar 21, 2013
    Likes Received:
    3. Search Models & Factos

    Complexity has a strong tendency to increase as time goes by. Search algorithms are no exception to this. Seomoz counts over 80 ranking factors for Google (as evident here There are most probably more. Bing and friends aren't far behind either. This, from my point of view, is only natural. Considering the sheer mass of the current web and the velocity at which new forms of content tend to appear, finding relevant content turns into a very complex procedure.

    Getting back to the problem at hand, further simplification is needed. Lets follow that seomoz chart and try to break down all these search factors into relevant categories. I have come up with the following list (feel free to post your corrections/suggestions):

    • Textual relevance (mainly how relevant is the search query to the textual content of the document)
    • Technical factors (how "well" is the site put together, is the content broken or is it functioning well, how fast does the site load, etc)
    • Backlink structure (internal as well as external. This can include regular backlinks as well as mentions and social signals)
    • Behavioral/UX factors (is the site liked by the user? Does it meet their expectations?)
    • Miscellaneous factors (Geolocation, locale, etc)

    This, to my knowledge, is a fairly complete breakdown. Again, feel free to correct me.
    From the chart linked above, Backlink structure is one of the most important metrics used by Google. At least according to seomoz. However, Id like to focus on the second most important metric first: textual relevance.

    As mentioned in my previous post, several algorithms exist to asses textual relevance of a given document. While seomoz mention stuff like "Keyword Usage in Title Tag (tf-idf)" and "Keyword Usage in H1 (tf-idf)", I believe this is just the tip of the iceberg. While we will most definitely consider these, we will have to start from the ground. We will have to figure out how to model out data in order to make it accessible in the most sensible and accurate manner.

    A detailed explanation and implementation will follow, but generally, we can represent out data using the following models:
    • Analytical models (boolean, extended boolean, fuzzy sets theory)
    • Algebraic models (latent semantic search, vector search, generalized vector search, neural network search)
    • Probabilistic models

    Don't worry if this doesn't make much sense right now. It will in due time. All of these models are very easy to implement and assess.
    But first, lets finally get coding and build some support tools!

    4. Scraping Google

    The very first thing we need to build is a generalized scraper. Initially for google, but something that we can extended later on. I've coded something real quick in cl. Here's the code, as mentioned previously, I'll host all of this on github in due time (binaries for win/mac/linux will be available too).

    ;;;; goog-scraper.lisp
    (in-package #:goog-scraper)
    (defparameter *searchuri* (concatenate 'string ""
    (defparameter *agent* "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a4) Gecko/20040927") 
    (defclass result ()
        :initarg :title :accessor result-title)
        :initarg :link :accessor result-link)))
    (defun construct-search-q (term start)
      "Constructs a search query from term and start position."
      (format nil *searchuri* (drakma:url-encode term :utf-8) start))
    (defun clean-link (link)
      (cl-ppcre:regex-replace-all "^.*?=|&sa.*" link ""))
    (defun parse-results (body)
      "Parses results returned by google."
      (let ((doc (chtml:parse body (cxml-stp:make-builder)))
        (results '()))
        (stp:do-recursively (h3 doc)
          (when (and (typep h3 'stp:element)
             (equal (stp:local-name h3) "h3")
             (equal (stp:attribute-value h3 "class") "r"))
        (let* ((a     (stp:first-child h3))
               (title (stp:string-value a))
               (link  (clean-link (stp:attribute-value a "href"))))
          (push (make-instance 'result 
                       :title title
                       :link  link) results))))
    (defun scrape-pages (term &optional (start 0))
      "Scrapes a page of Google results. Returns a list of links."
      (multiple-value-bind (body status)
          (drakma:http-request (construct-search-q term start)
                   :method :get
                   :user-agent *agent*
                   :connection-timeout 5)
        (if (= 200 status)
        (parse-results body))))
    If you are wondering "Whats up with all these parens???" I don't blame you haha. For now just remember that there's a method to this madness. Lisp's s-expressions (the stuff in the parens) allow for very powerful language constructs, which will become very apparent later on.
    Anyhow, the scraper is very dumb. The scrape-pages function takes a keyword and a starting page. Another function generates the search string and passes it on to the http request. The http request should return the first page of google results (100 at once) at which points the results are parsed and returned as an easily accessible list (think array in other languages). Here's a quick video of the thing in action:

    This should give you an idea of where I'm going with this. We'll be building some more support tools in the next thread and will start implementing some analytical models.

    Cheers everyone
    Last edited by a moderator: May 18, 2016
  5. technetium

    technetium Registered Member

    Jul 16, 2012
    Likes Received:
    You sir, may have lost your mind!

    I feel I would be remiss should I fail to encourage you however I am able LOL!
    Here's some more brain food, though some items will be more relevant than others. Of course, that's the whole point anyway.
  6. Gophering

    Gophering Junior Member

    Mar 21, 2013
    Likes Received:
    My good sir, I very well might have.
    I do appreciate your input very much indeed. I shall take a look and draw the relevant conclusions.

    • Thanks Thanks x 1