How does Google detect duplicate content? [Answer Inside]

sagarpatil · Apr 4, 2012

With the new Panda update, webmasters are getting nervous on what will happen to their websites because of Google?s stringent stance on demoting low quality pages, especially duplicate content. Google?s recent change is more of a static page quality change than a fundamental algorithm change. One might ask, with billions of pages in its index, how can a search engine detect duplicate content? This is a very difficult problem and I?m sure Google has a team of engineers working on it. Comparing each page to every other page in the index will take eons even with an inverted index! Then how does Google do it?

We can cast the problem of finding duplicate content as a problem of nearest neighbor search. A nearest neighbor search algorithm (originally from machine learning) takes an array of values and then finds its closest neighbors from a large number of arrays. These arrays typically contain feature values. A feature is a signal which has some amount of (ideally large) positive or negative correlation with the output class ? say the risk level of an insurance customer, or the risk of cancer in a patient. Nearest neighbor search finds a few arrays that are closest to the array in question from a large database of arrays. Once the nearest neighbors are found, the class of the array in question is determined based on the neighbors. (highest, lowest, mean, median of the neighbors could all be used depending on the application)

For duplicate document detection, the array would contain a ?1â€³ if a word was present in the document and a ?0â€³ if the word was not present. If there are a 100 thousand words in the dictionary (after removing stopwords and very high frequency words), each document becomes an array of a 100 thousand binary values. For each document, the problem now is to find its close neighbors! This is also a difficult problem because there are billions of documents and because of the curse of dimensionality. There are two ways of handling this, one using trees and the other using hashing. Trees do well for arrays that have a small number of values, but hashing is the prefered technique for handling arrays with very high number of values.

A hash function converts a string into a number, a number that represents a bucket. By converting a string to a bucket, searching for a string in a large number of strings becomes a very fast operation, since we just have to convert a string to a number, and then look if that bucket is empty or full. But hash functions are not perfect, and sometimes several strings hash to the same bucket. This is called a collision. Now what if we could write a hash function that assigns strings to the same bucket if they are in some way similar? Then we could find similar strings very fast from billions of strings!

This principle is applied to the binary arrays we talked about earlier, and locality sensitive hashing (a name given to hashing arrays instead of strings) is applied to the billions of arrays. Now given a document, we just hash it, and look for similar documents that are in the same bucket (collisions). The hash function in locality sensitive hashing is such that the probability of a collision happenning is much higher if the documents are similar. The problem of billions of documents still remains, but I?m sure Google has a farm of thousands of machines and can manage it easily (that is if it can sort a petabyte in 33 minutes

) So hashing ensures that duplicates and near duplicates can be found in a matter of milliseconds per document!

nimmivh · Apr 4, 2012

didnt understand anything

Can you say it in simple language :/

healzer · Apr 4, 2012

Good stuff Thanks!
+rep

SEO20 · Apr 4, 2012

Are you sure about this?
How would you explain eCommerce websites selling duplicate content items is not getting slapped? Special rules?

I'm sure they use some sort of hashing too - but I am also pretty sure that it's not THAT simple as stated here - just saying.

bigwhite · Apr 4, 2012

Lyrics websites as well. How would Lyric websites ever be ranking so high if they all provide duplicate content?

TZ2011 · Apr 5, 2012

Its good to put source of the article

Code:

http://www.prodigalwebmaster.com/2011/10/how-does-google-detect-duplicate-content/

dewaz · Apr 5, 2012

bigwhite said:
Lyrics websites as well. How would Lyric websites ever be ranking so high if they all provide duplicate content?

I would like to know this too.

What op told us is one of basic concept about google duplicate content checking. I believe this 'duplicate content checker' is part of huge system of google search machine with some human interfere.

imperial444 · Apr 5, 2012

Come on guys, they just do a manual check for duplicate content. that's it

Union · Apr 5, 2012

sagarpatil said:
With the new Panda update, webmasters are getting nervous on what will happen to their websites because of Google?s stringent stance on demoting low quality pages, especially duplicate content. Google?s recent change is more of a static page quality change than a fundamental algorithm change. One might ask, with billions of pages in its index, how can a search engine detect duplicate content? This is a very difficult problem and I?m sure Google has a team of engineers working on it. Comparing each page to every other page in the index will take eons even with an inverted index! Then how does Google do it?

We can cast the problem of finding duplicate content as a problem of nearest neighbor search. A nearest neighbor search algorithm (originally from machine learning) takes an array of values and then finds its closest neighbors from a large number of arrays. These arrays typically contain feature values. A feature is a signal which has some amount of (ideally large) positive or negative correlation with the output class ? say the risk level of an insurance customer, or the risk of cancer in a patient. Nearest neighbor search finds a few arrays that are closest to the array in question from a large database of arrays. Once the nearest neighbors are found, the class of the array in question is determined based on the neighbors. (highest, lowest, mean, median of the neighbors could all be used depending on the application)

For duplicate document detection, the array would contain a ?1â€³ if a word was present in the document and a ?0â€³ if the word was not present. If there are a 100 thousand words in the dictionary (after removing stopwords and very high frequency words), each document becomes an array of a 100 thousand binary values. For each document, the problem now is to find its close neighbors! This is also a difficult problem because there are billions of documents and because of the curse of dimensionality. There are two ways of handling this, one using trees and the other using hashing. Trees do well for arrays that have a small number of values, but hashing is the prefered technique for handling arrays with very high number of values.

A hash function converts a string into a number, a number that represents a bucket. By converting a string to a bucket, searching for a string in a large number of strings becomes a very fast operation, since we just have to convert a string to a number, and then look if that bucket is empty or full. But hash functions are not perfect, and sometimes several strings hash to the same bucket. This is called a collision. Now what if we could write a hash function that assigns strings to the same bucket if they are in some way similar? Then we could find similar strings very fast from billions of strings!

This principle is applied to the binary arrays we talked about earlier, and locality sensitive hashing (a name given to hashing arrays instead of strings) is applied to the billions of arrays. Now given a document, we just hash it, and look for similar documents that are in the same bucket (collisions). The hash function in locality sensitive hashing is such that the probability of a collision happenning is much higher if the documents are similar. The problem of billions of documents still remains, but I?m sure Google has a farm of thousands of machines and can manage it easily (that is if it can sort a petabyte in 33 minutes ) So hashing ensures that duplicates and near duplicates can be found in a matter of milliseconds per document!

Mate, i know about this changes... But in your Title you tell us about ANSWER, where is the answer on how to cheat on Panda ? Nobody care's about their algoritm, cause not many understand what is this about... they need to know how to Cheat and Save their $$$

Union · Apr 5, 2012

imperial444 said:
Come on guys, they just do a manual check for duplicate content. that's it

Ohhh, here is the answer

)))

And of course in this case, Google Hired up to 10 people to check only my Auto Blogs

))

Common, tell us this was a joke

livapetr · Apr 5, 2012

Yea, have read that on other website.

This is not all. They also employ n-gram theory to detect..... ready? ..... SPUN CONTENT!!! Ta-da.

Won't quote here as it's as complicated as OP's post. The only thing we must know here is that the algo is very rough and imprecise and you need to have near 80% of site's content spun flat to get on the radar. That's where you'll be taken cared of by a human.

Also, some guy here wrote that he knows such a reviewer and that this person has whole LOTTA pages to review and barely have time to review in detail and therefore checks roughly as well.

PS. Oh, yea, forgot. If you think that your {realize|actually realize} or {and|&} or {a lot of|a whole lot of} makes your spintax to some extent unique you're wrong

Words like "actually", "this", "that", "of" etc don't count.

Dan Da Man · Apr 5, 2012

Why do people just copy other people's content? So freakin annoying!!

You should get neg repped for trying to act like its your own

walandio · Apr 5, 2012

wow what an info!

KraftyKyle · Apr 5, 2012

I remember the first time I copied an article...

Code:

http://www.prodigalwebmaster.com/2011/10/how-does-google-detect-duplicate-content/

nbseo · Apr 5, 2012

imperial444 said:
Come on guys, they just do a manual check for duplicate content. that's it

if it so.. then have u imagine how many ppl they have to hire to do this work????

hypertoxic · Apr 5, 2012

sagarpatil said:
With the new Panda update, webmasters are getting nervous on what will happen to their websites because of Google?s stringent stance on demoting low quality pages, especially duplicate content. Google?s recent change is more of a static page quality change than a fundamental algorithm change. One might ask, with billions of pages in its index, how can a search engine detect duplicate content? This is a very difficult problem and I?m sure Google has a team of engineers working on it. Comparing each page to every other page in the index will take eons even with an inverted index! Then how does Google do it?

We can cast the problem of finding duplicate content as a problem of nearest neighbor search. A nearest neighbor search algorithm (originally from machine learning) takes an array of values and then finds its closest neighbors from a large number of arrays. These arrays typically contain feature values. A feature is a signal which has some amount of (ideally large) positive or negative correlation with the output class ? say the risk level of an insurance customer, or the risk of cancer in a patient. Nearest neighbor search finds a few arrays that are closest to the array in question from a large database of arrays. Once the nearest neighbors are found, the class of the array in question is determined based on the neighbors. (highest, lowest, mean, median of the neighbors could all be used depending on the application)

For duplicate document detection, the array would contain a ?1â€³ if a word was present in the document and a ?0â€³ if the word was not present. If there are a 100 thousand words in the dictionary (after removing stopwords and very high frequency words), each document becomes an array of a 100 thousand binary values. For each document, the problem now is to find its close neighbors! This is also a difficult problem because there are billions of documents and because of the curse of dimensionality. There are two ways of handling this, one using trees and the other using hashing. Trees do well for arrays that have a small number of values, but hashing is the prefered technique for handling arrays with very high number of values.

A hash function converts a string into a number, a number that represents a bucket. By converting a string to a bucket, searching for a string in a large number of strings becomes a very fast operation, since we just have to convert a string to a number, and then look if that bucket is empty or full. But hash functions are not perfect, and sometimes several strings hash to the same bucket. This is called a collision. Now what if we could write a hash function that assigns strings to the same bucket if they are in some way similar? Then we could find similar strings very fast from billions of strings!

This principle is applied to the binary arrays we talked about earlier, and locality sensitive hashing (a name given to hashing arrays instead of strings) is applied to the billions of arrays. Now given a document, we just hash it, and look for similar documents that are in the same bucket (collisions). The hash function in locality sensitive hashing is such that the probability of a collision happenning is much higher if the documents are similar. The problem of billions of documents still remains, but I?m sure Google has a farm of thousands of machines and can manage it easily (that is if it can sort a petabyte in 33 minutes ) So hashing ensures that duplicates and near duplicates can be found in a matter of milliseconds per document!

It went straight over my head in a blink

kaif0346 · Apr 5, 2012

sagarpatil said:
With the new Panda update, webmasters are getting nervous on what will happen to their websites because of Google?s stringent stance on demoting low quality pages, especially duplicate content. Google?s recent change is more of a static page quality change than a fundamental algorithm change. One might ask, with billions of pages in its index, how can a search engine detect duplicate content? This is a very difficult problem and I?m sure Google has a team of engineers working on it. Comparing each page to every other page in the index will take eons even with an inverted index! Then how does Google do it?

We can cast the problem of finding duplicate content as a problem of nearest neighbor search. A nearest neighbor search algorithm (originally from machine learning) takes an array of values and then finds its closest neighbors from a large number of arrays. These arrays typically contain feature values. A feature is a signal which has some amount of (ideally large) positive or negative correlation with the output class ? say the risk level of an insurance customer, or the risk of cancer in a patient. Nearest neighbor search finds a few arrays that are closest to the array in question from a large database of arrays. Once the nearest neighbors are found, the class of the array in question is determined based on the neighbors. (highest, lowest, mean, median of the neighbors could all be used depending on the application)

For duplicate document detection, the array would contain a ?1â€³ if a word was present in the document and a ?0â€³ if the word was not present. If there are a 100 thousand words in the dictionary (after removing stopwords and very high frequency words), each document becomes an array of a 100 thousand binary values. For each document, the problem now is to find its close neighbors! This is also a difficult problem because there are billions of documents and because of the curse of dimensionality. There are two ways of handling this, one using trees and the other using hashing. Trees do well for arrays that have a small number of values, but hashing is the prefered technique for handling arrays with very high number of values.

A hash function converts a string into a number, a number that represents a bucket. By converting a string to a bucket, searching for a string in a large number of strings becomes a very fast operation, since we just have to convert a string to a number, and then look if that bucket is empty or full. But hash functions are not perfect, and sometimes several strings hash to the same bucket. This is called a collision. Now what if we could write a hash function that assigns strings to the same bucket if they are in some way similar? Then we could find similar strings very fast from billions of strings!

This principle is applied to the binary arrays we talked about earlier, and locality sensitive hashing (a name given to hashing arrays instead of strings) is applied to the billions of arrays. Now given a document, we just hash it, and look for similar documents that are in the same bucket (collisions). The hash function in locality sensitive hashing is such that the probability of a collision happenning is much higher if the documents are similar. The problem of billions of documents still remains, but I?m sure Google has a farm of thousands of machines and can manage it easily (that is if it can sort a petabyte in 33 minutes ) So hashing ensures that duplicates and near duplicates can be found in a matter of milliseconds per document!

Old school'd Mathematica problem..

ghedman · Apr 5, 2012

i still can't find the answer

, but thanks for the informations!

sagarpatil · Apr 5, 2012

TZ2011 said:
Its good to put source of the article

Code:

http://www.prodigalwebmaster.com/2011/10/how-does-google-detect-duplicate-content/

Nobody likes a smart ass.

sagarpatil · Apr 5, 2012

Dan Da Man said:
Why do people just copy other people's content? So freakin annoying!!

You should get neg repped for trying to act like its your own

I just wanted to share the article, didn't come to my mind that I should have given link to the source article.

You could have also said this politely. There is no need to be so rude and negative repping.

How does Google detect duplicate content? [Answer Inside]

Power Member

Regular Member

Elite Member

BANNED

Regular Member

Senior Member

Regular Member

Elite Member

Power Member

Power Member

Regular Member

Elite Member

Supreme Member

Elite Member

Junior Member

Registered Member

Power Member

Registered Member

Power Member

Power Member

Main Menu

Marketplace

Making Money

BlackHat World