Gophering
Junior Member
- Mar 21, 2013
- 115
- 281
Got your attention? Nice.
Ok, I admit the title should probably be adjusted. "Reverse Engineering Google Search Appliance Algos from 2009" is probably more appropriate. A lot less glamourous, but hey its still fun (+ we might learn a couple of things)! Stick with it, this is gonna be a loooooooooong post (you might want to grab a tea or coffee or something a little heavier if you prefer). Also, I'll probably have to break this down a little...
WTF is Google Search Appliance
For those of you not familiar with GSA (Google Search Appliance), it looks something like this:
View attachment 28120
And here's the homepage: http://www.google.com/enterprise/search/campaigns/gsa7.html
You can purchase this magical box of goodness from Google and so you too can experience a little bit of Google's magic in your company! More seriously though, this is a standalone box that somewhat replicates Google's core algos so that you can run your own company crawler. You might use this on external URL's, images, internal URL's, databases and whatever else you can imagine. Pretty much your own custom Google brain right out of the box (quite literally).
So why is this interesting? Theoretically speaking if one could look under the hood, one could potentially discover some tasty secrets.
The Quest For a GSA Box
Since GSA's initial launch I've been constantly monitoring ebay, several torrent sites as well as several chinese sites in my search to get my hands on either the device or an image of the brains. Alas, this didn't happen. So, I kind of forgot about the whole thing for a couple of years. Until very recently, when I decided to have another look for a virtual machine file/iso (since getting the hardware right now is pretty much impossible, I believe the project has been discontinued).
Long story short, I did my regular search on yahoo, bing and google and didn't discover anything. However, a search on baidu brought up a download link to a 2009 version of the brains!
Very nice. Lets get this installed.
(Warning: Before we go on. I wanna make sure that everybody understands that this is a version from 2009. So not applicable for actual SEO. We might learn something new here, but consider all information presented here to be speculative in nature.)
Installing
After downloading the file (1.3gb) and extracting it (34gb or so) we are basically presented with a standard .vmdk collection. I fired up a basic RedHat Linux VM and loaded up the .vmdk image as the standard HD. For those interested, I use VirtualBox as my virtualization environment, however I believe that it would work equally well on Vmware or similar.
Lets hit start and see what happens.

Your regular CentOS startup.
Running
Alright so far so good. After waiting for a couple of minutes I was finally presented with a screen that basically told me "systems are running fine navigate to %IP% for search or %IP
ORT% for the admin panel". Said and done, here's GSA's standard search screen (searching for "blackhatworld" doesn't return any results since we first need to crawl a couple of pages):

Looks pretty cool. The pre Penguin/Panda and so on interface with the additional "Appliance" menu. Alright, lets get some pages indexed. I was prompted for a password when accessing the admin interface. Fortunately enough, the download included a "Read Me First" file which contained the password.

The control panel. A bunch of stuff in there (which I haven't really explored yet). The whole thing is fairly intelligent, various crawler settings (for example, bypassing submit forms and so on), pattern matching options and much more. I imagine that even today (almost 5 years on) this could be fairly useful for data indexing and further processing.

Adding out patterns and running the crawler. I've added blackhatworld here and started "continues crawling". After a few minutes I've shut down the crawler and:

Cool it appears to be working! I must note that I really liked the interface of this whole thing. Matched with the underlying functionality (+ various API feeding options) this might turn out be a very useful little system.
Anyhow. Lets dig some more.
Under The Hood
Since we have the image file, why not dig a little deeper? The problem is, the system starts up with the default "Access blabla at IP blabal" screen and locks you out of bash... So no commands can be executed (except for API calls... but this is rather boring). Calling up the lilo bootloader upon system load and then trying to boot with the init or single command causes the following:

The lilo password was unfortunately missing from the readme...
I ended up examining the flat .vmdk image with parted and noting down the partitions (swap as well as two ext3 partitions, a massive one holding, most probably the brains, as well as a smaller one holding the root filesystem). After that, I pretty much noted down the sections of the root partition and mounted it into my local filesystem. Thus giving me access to the core gsa archive (minus the all important extended partition which houses those google search scripts). Examining /etc/lilo.conf gives us the following:

We got the password! Alright, after restarting the machine and booting with "single rw" we finally get our bash session going! Success!
(Bear with me please as I create a second post. Exceeded attachment limit...)
Ok, I admit the title should probably be adjusted. "Reverse Engineering Google Search Appliance Algos from 2009" is probably more appropriate. A lot less glamourous, but hey its still fun (+ we might learn a couple of things)! Stick with it, this is gonna be a loooooooooong post (you might want to grab a tea or coffee or something a little heavier if you prefer). Also, I'll probably have to break this down a little...
WTF is Google Search Appliance
For those of you not familiar with GSA (Google Search Appliance), it looks something like this:
View attachment 28120
And here's the homepage: http://www.google.com/enterprise/search/campaigns/gsa7.html
You can purchase this magical box of goodness from Google and so you too can experience a little bit of Google's magic in your company! More seriously though, this is a standalone box that somewhat replicates Google's core algos so that you can run your own company crawler. You might use this on external URL's, images, internal URL's, databases and whatever else you can imagine. Pretty much your own custom Google brain right out of the box (quite literally).
So why is this interesting? Theoretically speaking if one could look under the hood, one could potentially discover some tasty secrets.
The Quest For a GSA Box
Since GSA's initial launch I've been constantly monitoring ebay, several torrent sites as well as several chinese sites in my search to get my hands on either the device or an image of the brains. Alas, this didn't happen. So, I kind of forgot about the whole thing for a couple of years. Until very recently, when I decided to have another look for a virtual machine file/iso (since getting the hardware right now is pretty much impossible, I believe the project has been discontinued).
Long story short, I did my regular search on yahoo, bing and google and didn't discover anything. However, a search on baidu brought up a download link to a 2009 version of the brains!
Very nice. Lets get this installed.
(Warning: Before we go on. I wanna make sure that everybody understands that this is a version from 2009. So not applicable for actual SEO. We might learn something new here, but consider all information presented here to be speculative in nature.)
Installing
After downloading the file (1.3gb) and extracting it (34gb or so) we are basically presented with a standard .vmdk collection. I fired up a basic RedHat Linux VM and loaded up the .vmdk image as the standard HD. For those interested, I use VirtualBox as my virtualization environment, however I believe that it would work equally well on Vmware or similar.
Lets hit start and see what happens.

Your regular CentOS startup.
Running
Alright so far so good. After waiting for a couple of minutes I was finally presented with a screen that basically told me "systems are running fine navigate to %IP% for search or %IP

Looks pretty cool. The pre Penguin/Panda and so on interface with the additional "Appliance" menu. Alright, lets get some pages indexed. I was prompted for a password when accessing the admin interface. Fortunately enough, the download included a "Read Me First" file which contained the password.

The control panel. A bunch of stuff in there (which I haven't really explored yet). The whole thing is fairly intelligent, various crawler settings (for example, bypassing submit forms and so on), pattern matching options and much more. I imagine that even today (almost 5 years on) this could be fairly useful for data indexing and further processing.

Adding out patterns and running the crawler. I've added blackhatworld here and started "continues crawling". After a few minutes I've shut down the crawler and:

Cool it appears to be working! I must note that I really liked the interface of this whole thing. Matched with the underlying functionality (+ various API feeding options) this might turn out be a very useful little system.
Anyhow. Lets dig some more.
Under The Hood
Since we have the image file, why not dig a little deeper? The problem is, the system starts up with the default "Access blabla at IP blabal" screen and locks you out of bash... So no commands can be executed (except for API calls... but this is rather boring). Calling up the lilo bootloader upon system load and then trying to boot with the init or single command causes the following:

The lilo password was unfortunately missing from the readme...
I ended up examining the flat .vmdk image with parted and noting down the partitions (swap as well as two ext3 partitions, a massive one holding, most probably the brains, as well as a smaller one holding the root filesystem). After that, I pretty much noted down the sections of the root partition and mounted it into my local filesystem. Thus giving me access to the core gsa archive (minus the all important extended partition which houses those google search scripts). Examining /etc/lilo.conf gives us the following:

We got the password! Alright, after restarting the machine and booting with "single rw" we finally get our bash session going! Success!
(Bear with me please as I create a second post. Exceeded attachment limit...)






