1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Reverse Engineering Google's Algos

Discussion in 'Black Hat SEO' started by Gophering, Apr 11, 2013.

  1. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Got your attention? Nice.

    Ok, I admit the title should probably be adjusted. "Reverse Engineering Google Search Appliance Algos from 2009" is probably more appropriate. A lot less glamourous, but hey its still fun (+ we might learn a couple of things)! Stick with it, this is gonna be a loooooooooong post (you might want to grab a tea or coffee or something a little heavier if you prefer). Also, I'll probably have to break this down a little...

    WTF is Google Search Appliance

    For those of you not familiar with GSA (Google Search Appliance), it looks something like this:
    View attachment 28120

    And here's the homepage: http://www.google.com/enterprise/search/campaigns/gsa7.html
    You can purchase this magical box of goodness from Google and so you too can experience a little bit of Google's magic in your company! More seriously though, this is a standalone box that somewhat replicates Google's core algos so that you can run your own company crawler. You might use this on external URL's, images, internal URL's, databases and whatever else you can imagine. Pretty much your own custom Google brain right out of the box (quite literally).

    So why is this interesting? Theoretically speaking if one could look under the hood, one could potentially discover some tasty secrets.

    The Quest For a GSA Box

    Since GSA's initial launch I've been constantly monitoring ebay, several torrent sites as well as several chinese sites in my search to get my hands on either the device or an image of the brains. Alas, this didn't happen. So, I kind of forgot about the whole thing for a couple of years. Until very recently, when I decided to have another look for a virtual machine file/iso (since getting the hardware right now is pretty much impossible, I believe the project has been discontinued).

    Long story short, I did my regular search on yahoo, bing and google and didn't discover anything. However, a search on baidu brought up a download link to a 2009 version of the brains!

    Very nice. Lets get this installed.

    (Warning: Before we go on. I wanna make sure that everybody understands that this is a version from 2009. So not applicable for actual SEO. We might learn something new here, but consider all information presented here to be speculative in nature.)

    Installing
    After downloading the file (1.3gb) and extracting it (34gb or so) we are basically presented with a standard .vmdk collection. I fired up a basic RedHat Linux VM and loaded up the .vmdk image as the standard HD. For those interested, I use VirtualBox as my virtualization environment, however I believe that it would work equally well on Vmware or similar.

    Lets hit start and see what happens.
    q2ibdN4.png

    Your regular CentOS startup.

    Running
    Alright so far so good. After waiting for a couple of minutes I was finally presented with a screen that basically told me "systems are running fine navigate to %IP% for search or %IP:pORT% for the admin panel". Said and done, here's GSA's standard search screen (searching for "blackhatworld" doesn't return any results since we first need to crawl a couple of pages):
    22n9gcx.jpg

    Looks pretty cool. The pre Penguin/Panda and so on interface with the additional "Appliance" menu. Alright, lets get some pages indexed. I was prompted for a password when accessing the admin interface. Fortunately enough, the download included a "Read Me First" file which contained the password.
    DZn7ebY.jpg

    The control panel. A bunch of stuff in there (which I haven't really explored yet). The whole thing is fairly intelligent, various crawler settings (for example, bypassing submit forms and so on), pattern matching options and much more. I imagine that even today (almost 5 years on) this could be fairly useful for data indexing and further processing.
    9cvbeNE.jpg
    va0motO.jpg

    Adding out patterns and running the crawler. I've added blackhatworld here and started "continues crawling". After a few minutes I've shut down the crawler and:
    5iLx0yl.jpg

    Cool it appears to be working! I must note that I really liked the interface of this whole thing. Matched with the underlying functionality (+ various API feeding options) this might turn out be a very useful little system.
    Anyhow. Lets dig some more.

    Under The Hood
    Since we have the image file, why not dig a little deeper? The problem is, the system starts up with the default "Access blabla at IP blabal" screen and locks you out of bash... So no commands can be executed (except for API calls... but this is rather boring). Calling up the lilo bootloader upon system load and then trying to boot with the init or single command causes the following:
    ZZF2YwE.png

    The lilo password was unfortunately missing from the readme...
    I ended up examining the flat .vmdk image with parted and noting down the partitions (swap as well as two ext3 partitions, a massive one holding, most probably the brains, as well as a smaller one holding the root filesystem). After that, I pretty much noted down the sections of the root partition and mounted it into my local filesystem. Thus giving me access to the core gsa archive (minus the all important extended partition which houses those google search scripts). Examining /etc/lilo.conf gives us the following:
    cA64J9K.png
    We got the password! Alright, after restarting the machine and booting with "single rw" we finally get our bash session going! Success!
    (Bear with me please as I create a second post. Exceeded attachment limit...)
     
    • Thanks Thanks x 91
  2. crazyflx

    crazyflx Elite Member

    Joined:
    Nov 9, 2009
    Messages:
    1,674
    Likes Received:
    4,825
    Location:
    http://CRAZYFLX.COM
    Home Page:
    I'll be tuning in for this one. This looks like it could be incredibly juicy. +rep & thanks for the effort and for sharing this, regardless of where it goes.

    I encourage you to continue to flesh this out, even if it doesn't go anywhere OP. It's threads like this that make this forum, even if it's just theory, you're putting the effort forward to actually do something while simultaneously sharing something most people don't even know exists.

    I'd take 100 threads like this that possibly fizzle out and go nowhere, than even one single "follow me on my journey" thread.

    Keep it up.
     
    • Thanks Thanks x 9
    Last edited: Apr 11, 2013
  3. LakeForest

    LakeForest Supreme Member

    Joined:
    Nov 11, 2009
    Messages:
    1,269
    Likes Received:
    1,802
    Location:
    Location Location
    Yeah, it's almost like Google should hire OP.
     
    • Thanks Thanks x 2
  4. m888e

    m888e Newbie

    Joined:
    Jan 2, 2013
    Messages:
    16
    Likes Received:
    1
    Hey that is wicked cool. I remember asking google how much to buy one of these many years ago and it was something really high like $20k or $30k or something crazy
     
  5. AR!ZONA

    AR!ZONA Regular Member

    Joined:
    Mar 20, 2012
    Messages:
    400
    Likes Received:
    391
    Location:
    Cactus Island
    MOAR!!!
    Thanks & +REP, OP
     
  6. BabyMonster

    BabyMonster Power Member

    Joined:
    Feb 4, 2007
    Messages:
    711
    Likes Received:
    221
    Location:
    Street
    Woah!! I'll keep track of your thread. This is something new to me. Subscribed.
     
  7. xpleet

    xpleet Regular Member

    Joined:
    Jan 18, 2010
    Messages:
    377
    Likes Received:
    327
    Location:
    Morocco
    I had a similar idea to reverse-engineer "google desktop" and dig inside maybe I can find something interessant.
    Subscribed and waiting for the 2nd part.
    +Thanks & +Rep added.
     
  8. zenlagor

    zenlagor Regular Member

    Joined:
    Apr 4, 2013
    Messages:
    357
    Likes Received:
    184
    Occupation:
    Virtual Pimp
    Location:
    Colombia
    Home Page:
    Many moons ago at my old work I setup a google mini. It basically just indexes an intranet to be as searchable as google is on the internet. I remember one time the google mini lost its connection to the internet and that caused the customer loads of problems. From that you could probably assume a lot of things, but one of them would be that it's just playing piggy in the middle with the data. Not sure the differences in the product you've linked to the google mini.

    If you are in the UK there's a google mini search appliance on ebay for 400 quid right now. Just search google mini and sort by price highest.
     
  9. futurestunner

    futurestunner BANNED BANNED

    Joined:
    Dec 26, 2009
    Messages:
    1,532
    Likes Received:
    1,036
    Good work... waiting for the end result. :)

    Good luck OP
     
  10. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Digging Some More

    Alright so we got access. I must note that most of the important bits are held in compiled C, however there's some pretty interesting, uncompiled, python stuff lying around, too. Crawler, indexing and ranking scripts rely on bits and pieces of python code. So, lets start fucking digging! Firstly, the all important external partition:
    9PWWxhH.png

    Now keep in mind that I don't really know where the juicy parts might be as there are a ton of files to explore. So most of the stuff here will probably be rather mild and nothing new, until I have some more time to explore things. Anyhow, after browsing around a little, I ended up in the "quality" folder. Lets check that one out. cding and lsing we get a certain python file in the "rankboost/indexing" directory. Whats your take on this?
    s56huH5.png
    A certain directory called "spelling" contains a huge set of well... dictionary files (strangely formatted too), with various filters, stop words, etc. etc.
    GROfyxx.png
    Hehe "sex", "essex". I think they should associate those.
    In the proto spam file:
    V28K2vC.png
    In the crawler files:
    fFlnshL.png
    Some more snippets from rankboost.
    91Oy62i.png
    Googlebot stuff:
    GyDQA4W.png
    Uhh... Attachment limit again...

    So where's this thread heading?
    I'm not quite sure yet. Firstly, I want to explore the python base some more. There are a bunch of files over here, as well as compiled binaries, libraries and a ton of other things. I also want to play around with the interface itself, do some test crawls, record my findings and so on.

    This is just an experiment. I wanted to see whats under the hood for a long time, so I got my peek inside. The plan is to revisit this thread every other day or so and update with my findings. If thats of interest I'll be glad to be of service.

    Cheers guys and stay tuned.
     
    • Thanks Thanks x 25
  11. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Appreciate the interest man. I really don't know where this will go. On first look around some stuff seems rather interesting. I'll probably post the files up somewhere later on. No point in me going through them alone when a bunch of brains/eyes could be working together. I'll definitely keep at it though, theres stuff to be explored. The interface itself seems very interesting.

    Hmm, you are right. This is not a Google Mini though, I don't think GSA's are sold anymore (however I might be wrong)... Parts of it might be acting as a stupid client, however certain parts are most definitely standalone piece of code. More exploration will follow.
     
    • Thanks Thanks x 3
  12. sire243

    sire243 Regular Member

    Joined:
    Jun 23, 2010
    Messages:
    255
    Likes Received:
    112
    Subbed and rep coming my way. I'm downloading it right now, and I'll see if I could follow what you did. :)
     
    • Thanks Thanks x 1
  13. Gophering

    Gophering Junior Member Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    115
    Likes Received:
    279
    Occupation:
    Unemployed
    Location:
    EU
    Nice! Have fun with this. You got the password anyhow so just boot with init or single and you should be good to go. Feel free to post your findings over here.
     
    • Thanks Thanks x 4
  14. manny2513

    manny2513 Junior Member

    Joined:
    Apr 4, 2011
    Messages:
    106
    Likes Received:
    42
    Long Read but good content thanks a lot.
     
    • Thanks Thanks x 1
  15. youtalk

    youtalk Regular Member

    Joined:
    Jul 5, 2012
    Messages:
    337
    Likes Received:
    6
    Occupation:
    Owner
    Location:
    I don't even know anymore
    Always great work Gophering...
     
    • Thanks Thanks x 1
  16. shurk

    shurk Junior Member

    Joined:
    Feb 2, 2011
    Messages:
    122
    Likes Received:
    45
    This is going to be an interesting thread! :D
     
    • Thanks Thanks x 1
  17. bobred

    bobred Registered Member

    Joined:
    Dec 21, 2011
    Messages:
    98
    Likes Received:
    63
    A fascinating insight, even if their algo has changed vastly since this there is still a wealth of information to be gleaned just from those screenies alone.

    Thanked, +rep and i offer you my first-born.
     
    • Thanks Thanks x 1
  18. dgruergerugerhiye

    dgruergerugerhiye BANNED BANNED Jr. VIP Premium Member

    Joined:
    Nov 4, 2010
    Messages:
    305
    Likes Received:
    450
    You must spread some Reputation around before giving it to Gophering again.


    Shame.
     
    • Thanks Thanks x 2
  19. system0102

    system0102 Regular Member

    Joined:
    Nov 26, 2012
    Messages:
    339
    Likes Received:
    426
    That's awesome! Will follow the thread to see what you guys can find as I'm not really a tech person :) Thanks for doing it OP!
     
    • Thanks Thanks x 1
  20. marishal

    marishal Registered Member

    Joined:
    Jan 5, 2012
    Messages:
    80
    Likes Received:
    14
    I'm not sure why we're assuming that the Google search appliance is the same as the Google internal algo (even back in 2009). The basic functions such as crawling is probably very similar but I doubt that there's any additional parallels especially with the index. And when it comes to the index I think that there are two things that we're interested in: ranking and penalty factors.

    There might be some ranking info bits in this appliance (we'll never be sure).
    The penalty factors from what I understand are applied retroactively against the existing index based on some pattern or footprint, which means that it will not be part of the "index" otherwise the system will automatically "penalize" websites as it "indexes" them.

    Anyways, I'm not trying to take away from the OP's work and there might be some value after all, it's just we will never know :)
     
    • Thanks Thanks x 1