With the new Panda update, webmasters are getting nervous on what will happen to their websites because of Google?s stringent stance on demoting low quality pages, especially duplicate content. Google?s recent change is more of a static page quality change than a fundamental algorithm change. One might ask, with billions of pages in its index, how can a search engine detect duplicate content? This is a very difficult problem and I?m sure Google has a team of engineers working on it. Comparing each page to every other page in the index will take eons even with an inverted index! Then how does Google do it? We can cast the problem of finding duplicate content as a problem of nearest neighbor search. A nearest neighbor search algorithm (originally from machine learning) takes an array of values and then finds its closest neighbors from a large number of arrays. These arrays typically contain feature values. A feature is a signal which has some amount of (ideally large) positive or negative correlation with the output class ? say the risk level of an insurance customer, or the risk of cancer in a patient. Nearest neighbor search finds a few arrays that are closest to the array in question from a large database of arrays. Once the nearest neighbors are found, the class of the array in question is determined based on the neighbors. (highest, lowest, mean, median of the neighbors could all be used depending on the application) For duplicate document detection, the array would contain a ?1â€³ if a word was present in the document and a ?0â€³ if the word was not present. If there are a 100 thousand words in the dictionary (after removing stopwords and very high frequency words), each document becomes an array of a 100 thousand binary values. For each document, the problem now is to find its close neighbors! This is also a difficult problem because there are billions of documents and because of the curse of dimensionality. There are two ways of handling this, one using trees and the other using hashing. Trees do well for arrays that have a small number of values, but hashing is the prefered technique for handling arrays with very high number of values. A hash function converts a string into a number, a number that represents a bucket. By converting a string to a bucket, searching for a string in a large number of strings becomes a very fast operation, since we just have to convert a string to a number, and then look if that bucket is empty or full. But hash functions are not perfect, and sometimes several strings hash to the same bucket. This is called a collision. Now what if we could write a hash function that assigns strings to the same bucket if they are in some way similar? Then we could find similar strings very fast from billions of strings! This principle is applied to the binary arrays we talked about earlier, and locality sensitive hashing (a name given to hashing arrays instead of strings) is applied to the billions of arrays. Now given a document, we just hash it, and look for similar documents that are in the same bucket (collisions). The hash function in locality sensitive hashing is such that the probability of a collision happenning is much higher if the documents are similar. The problem of billions of documents still remains, but I?m sure Google has a farm of thousands of machines and can manage it easily (that is if it can sort a petabyte in 33 minutes ) So hashing ensures that duplicates and near duplicates can be found in a matter of milliseconds per document!