Hehe, as mentioned before, Google provides software, that makes web statistic (& more) data surfing easy für any purpose. GCHQ goes Google Net spies turn to MapReduce Source:http://www.theregister.co.uk/2010/11/08/gchq_google/ Britain's digital spies have turned to Google for help making sense of the floods of data now inundating their powerful computing resources. GCHQ, the Cheltenham-based signals intelligence agency, is recruiting an expert on MapReduce, the patented number-crunching technique previously behind the dominant web search engine. The agency's new lead researcher on data mining will be responsible for "developing MapReduce analytics on parallel computing clusters", a job advertisment reveals. MapReduce was developed by Google to index billions of web pages across its cluster of hundreds of thousands of commodity servers. It breaks up complicated tasks into smaller, easier computing problems that cheap hardware is capable of solving quickly. Google patented the technique earlier this year, but it remains free for other organisations to adopt via Hadoop, an open source project. Originally described in a 2004 research paper, MapReduce has allowed Google's algorithms to index a rapidly expanding web while keeping costs down. GCHQ faces similar a challenge as it gathers more and more raw data from internet communications, including email, social networks and VoIP. "Successful data-driven organisations must be able to process, interpret and rapidly respond to indicators derived from unprecedented volumes of data from disparate information sources," its recruitment advertisement says. The Register understands that GCHQ now has a cluster of more than 250,000 commodity servers under its Cheltenham "doughnut" building. In recent years it has developed this Google-style infrastructure instead of the very expensive, bespoke supercomputers it used to analyse microwave intercepts during the Cold War. While spies are planning research on MapReduce, Google has already moved on to BigTable, its new distributed database. No Paranoia, guys! Just to clarify:You should know that the spider's net is growing, and think when to use it and in which cases your robots.txt, noarchive tag & htaccess should be fixed and when not. (As I remember M.Cutts says nocache & noarchive is always suspicious for G) Time for some reverse engineering on the ground level I don't know about the Freedom of Information Act realization in GB, but there should be some questions answered!