Reverse Engineering Google's Algos

Gophering · Apr 11, 2013

Got your attention? Nice.

Ok, I admit the title should probably be adjusted. "Reverse Engineering Google Search Appliance Algos from 2009" is probably more appropriate. A lot less glamourous, but hey its still fun (+ we might learn a couple of things)! Stick with it, this is gonna be a loooooooooong post (you might want to grab a tea or coffee or something a little heavier if you prefer). Also, I'll probably have to break this down a little...

WTF is Google Search Appliance

For those of you not familiar with GSA (Google Search Appliance), it looks something like this:
View attachment 28120

And here's the homepage: http://www.google.com/enterprise/search/campaigns/gsa7.html
You can purchase this magical box of goodness from Google and so you too can experience a little bit of Google's magic in your company! More seriously though, this is a standalone box that somewhat replicates Google's core algos so that you can run your own company crawler. You might use this on external URL's, images, internal URL's, databases and whatever else you can imagine. Pretty much your own custom Google brain right out of the box (quite literally).

So why is this interesting? Theoretically speaking if one could look under the hood, one could potentially discover some tasty secrets.

The Quest For a GSA Box

Since GSA's initial launch I've been constantly monitoring ebay, several torrent sites as well as several chinese sites in my search to get my hands on either the device or an image of the brains. Alas, this didn't happen. So, I kind of forgot about the whole thing for a couple of years. Until very recently, when I decided to have another look for a virtual machine file/iso (since getting the hardware right now is pretty much impossible, I believe the project has been discontinued).

Long story short, I did my regular search on yahoo, bing and google and didn't discover anything. However, a search on baidu brought up a download link to a 2009 version of the brains!

Very nice. Lets get this installed.

(Warning: Before we go on. I wanna make sure that everybody understands that this is a version from 2009. So not applicable for actual SEO. We might learn something new here, but consider all information presented here to be speculative in nature.)

Installing
After downloading the file (1.3gb) and extracting it (34gb or so) we are basically presented with a standard .vmdk collection. I fired up a basic RedHat Linux VM and loaded up the .vmdk image as the standard HD. For those interested, I use VirtualBox as my virtualization environment, however I believe that it would work equally well on Vmware or similar.

Lets hit start and see what happens.

Your regular CentOS startup.

Running
Alright so far so good. After waiting for a couple of minutes I was finally presented with a screen that basically told me "systems are running fine navigate to %IP% for search or %IP

ORT% for the admin panel". Said and done, here's GSA's standard search screen (searching for "blackhatworld" doesn't return any results since we first need to crawl a couple of pages):

Looks pretty cool. The pre Penguin/Panda and so on interface with the additional "Appliance" menu. Alright, lets get some pages indexed. I was prompted for a password when accessing the admin interface. Fortunately enough, the download included a "Read Me First" file which contained the password.

The control panel. A bunch of stuff in there (which I haven't really explored yet). The whole thing is fairly intelligent, various crawler settings (for example, bypassing submit forms and so on), pattern matching options and much more. I imagine that even today (almost 5 years on) this could be fairly useful for data indexing and further processing.

Adding out patterns and running the crawler. I've added blackhatworld here and started "continues crawling". After a few minutes I've shut down the crawler and:

Cool it appears to be working! I must note that I really liked the interface of this whole thing. Matched with the underlying functionality (+ various API feeding options) this might turn out be a very useful little system.
Anyhow. Lets dig some more.

Under The Hood
Since we have the image file, why not dig a little deeper? The problem is, the system starts up with the default "Access blabla at IP blabal" screen and locks you out of bash... So no commands can be executed (except for API calls... but this is rather boring). Calling up the lilo bootloader upon system load and then trying to boot with the init or single command causes the following:

The lilo password was unfortunately missing from the readme...
I ended up examining the flat .vmdk image with parted and noting down the partitions (swap as well as two ext3 partitions, a massive one holding, most probably the brains, as well as a smaller one holding the root filesystem). After that, I pretty much noted down the sections of the root partition and mounted it into my local filesystem. Thus giving me access to the core gsa archive (minus the all important extended partition which houses those google search scripts). Examining /etc/lilo.conf gives us the following:

We got the password! Alright, after restarting the machine and booting with "single rw" we finally get our bash session going! Success!
(Bear with me please as I create a second post. Exceeded attachment limit...)

crazyflx · Apr 11, 2013

I'll be tuning in for this one. This looks like it could be incredibly juicy. +rep & thanks for the effort and for sharing this, regardless of where it goes.

I encourage you to continue to flesh this out, even if it doesn't go anywhere OP. It's threads like this that make this forum, even if it's just theory, you're putting the effort forward to actually do something while simultaneously sharing something most people don't even know exists.

I'd take 100 threads like this that possibly fizzle out and go nowhere, than even one single "follow me on my journey" thread.

Keep it up.

LakeForest · Apr 11, 2013

Yeah, it's almost like Google should hire OP.

m888e · Apr 11, 2013

Hey that is wicked cool. I remember asking google how much to buy one of these many years ago and it was something really high like $20k or $30k or something crazy

AR!ZONA · Apr 11, 2013

MOAR!!!
Thanks & +REP, OP

BabyMonster · Apr 11, 2013

Woah!! I'll keep track of your thread. This is something new to me. Subscribed.

xpleet · Apr 11, 2013

I had a similar idea to reverse-engineer "google desktop" and dig inside maybe I can find something interessant.
Subscribed and waiting for the 2nd part.
+Thanks & +Rep added.

zenlagor · Apr 11, 2013

Many moons ago at my old work I setup a google mini. It basically just indexes an intranet to be as searchable as google is on the internet. I remember one time the google mini lost its connection to the internet and that caused the customer loads of problems. From that you could probably assume a lot of things, but one of them would be that it's just playing piggy in the middle with the data. Not sure the differences in the product you've linked to the google mini.

If you are in the UK there's a google mini search appliance on ebay for 400 quid right now. Just search google mini and sort by price highest.

futurestunner · Apr 11, 2013

Good work... waiting for the end result.

Good luck OP

Gophering · Apr 11, 2013

Digging Some More

Alright so we got access. I must note that most of the important bits are held in compiled C, however there's some pretty interesting, uncompiled, python stuff lying around, too. Crawler, indexing and ranking scripts rely on bits and pieces of python code. So, lets start fucking digging! Firstly, the all important external partition:

Now keep in mind that I don't really know where the juicy parts might be as there are a ton of files to explore. So most of the stuff here will probably be rather mild and nothing new, until I have some more time to explore things. Anyhow, after browsing around a little, I ended up in the "quality" folder. Lets check that one out. cding and lsing we get a certain python file in the "rankboost/indexing" directory. Whats your take on this?

A certain directory called "spelling" contains a huge set of well... dictionary files (strangely formatted too), with various filters, stop words, etc. etc.

Hehe "sex", "essex". I think they should associate those.
In the proto spam file:

In the crawler files:

Some more snippets from rankboost.

Googlebot stuff:

Uhh... Attachment limit again...

So where's this thread heading?
I'm not quite sure yet. Firstly, I want to explore the python base some more. There are a bunch of files over here, as well as compiled binaries, libraries and a ton of other things. I also want to play around with the interface itself, do some test crawls, record my findings and so on.

This is just an experiment. I wanted to see whats under the hood for a long time, so I got my peek inside. The plan is to revisit this thread every other day or so and update with my findings. If thats of interest I'll be glad to be of service.

Cheers guys and stay tuned.

Gophering · Apr 11, 2013

crazyflx said:
I'll be tuning in for this one. This looks like it could be incredibly juicy. +rep & thanks for the effort and for sharing this, regardless of where it goes.

I encourage you to continue to flesh this out, even if it doesn't go anywhere OP. It's threads like this that make this forum, even if it's just theory, you're putting the effort forward to actually do something while simultaneously sharing something most people don't even know exists.

I'd take 100 threads like this that possibly fizzle out and go nowhere, than even one single "follow me on my journey" thread.

Keep it up.

Appreciate the interest man. I really don't know where this will go. On first look around some stuff seems rather interesting. I'll probably post the files up somewhere later on. No point in me going through them alone when a bunch of brains/eyes could be working together. I'll definitely keep at it though, theres stuff to be explored. The interface itself seems very interesting.

Many moons ago at my old work I setup a google mini. It basically just indexes an intranet to be as searchable as google is on the internet. I remember one time the google mini lost its connection to the internet and that caused the customer loads of problems. From that you could probably assume a lot of things, but one of them would be that it's just playing piggy in the middle with the data. Not sure the differences in the product you've linked to the google mini.

If you are in the UK there's a google mini search appliance on ebay for 400 quid right now. Just search google mini and sort by price highest.

Hmm, you are right. This is not a Google Mini though, I don't think GSA's are sold anymore (however I might be wrong)... Parts of it might be acting as a stupid client, however certain parts are most definitely standalone piece of code. More exploration will follow.

sire243 · Apr 11, 2013

Subbed and rep coming my way. I'm downloading it right now, and I'll see if I could follow what you did.

Gophering · Apr 11, 2013

sire243 said:
Subbed and rep coming my way. I'm downloading it right now, and I'll see if I could follow what you did.

Nice! Have fun with this. You got the password anyhow so just boot with init or single and you should be good to go. Feel free to post your findings over here.

manny2513 · Apr 11, 2013

Long Read but good content thanks a lot.

youtalk · Apr 11, 2013

Always great work Gophering...

shurk · Apr 11, 2013

This is going to be an interesting thread!

bobred · Apr 11, 2013

A fascinating insight, even if their algo has changed vastly since this there is still a wealth of information to be gleaned just from those screenies alone.

Thanked, +rep and i offer you my first-born.

dgruergerugerhiye · Apr 11, 2013

You must spread some Reputation around before giving it to Gophering again.

Shame.

system0102 · Apr 11, 2013

That's awesome! Will follow the thread to see what you guys can find as I'm not really a tech person

Thanks for doing it OP!

marishal · Apr 11, 2013

I'm not sure why we're assuming that the Google search appliance is the same as the Google internal algo (even back in 2009). The basic functions such as crawling is probably very similar but I doubt that there's any additional parallels especially with the index. And when it comes to the index I think that there are two things that we're interested in: ranking and penalty factors.

There might be some ranking info bits in this appliance (we'll never be sure).
The penalty factors from what I understand are applied retroactively against the existing index based on some pattern or footprint, which means that it will not be part of the "index" otherwise the system will automatically "penalize" websites as it "indexes" them.

Anyways, I'm not trying to take away from the OP's work and there might be some value after all, it's just we will never know

Reverse Engineering Google's Algos

Junior Member

Elite Member

Supreme Member

Newbie

Regular Member

BANNED

Regular Member

Regular Member

BANNED

Junior Member

Junior Member

Regular Member

Junior Member

Junior Member

Regular Member

Junior Member

Junior Member

BANNED

Regular Member

Registered Member

Main Menu

Marketplace

Making Money

BlackHat World