1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How does Google detect duplicate content? [Answer Inside]

Discussion in 'White Hat SEO' started by sagarpatil, Apr 4, 2012.

  1. sagarpatil

    sagarpatil Regular Member

    Joined:
    Mar 2, 2009
    Messages:
    458
    Likes Received:
    278
    With the new Panda update, webmasters are getting nervous on what will happen to their websites because of Google?s stringent stance on demoting low quality pages, especially duplicate content. Google?s recent change is more of a static page quality change than a fundamental algorithm change. One might ask, with billions of pages in its index, how can a search engine detect duplicate content? This is a very difficult problem and I?m sure Google has a team of engineers working on it. Comparing each page to every other page in the index will take eons even with an inverted index! Then how does Google do it?

    We can cast the problem of finding duplicate content as a problem of nearest neighbor search. A nearest neighbor search algorithm (originally from machine learning) takes an array of values and then finds its closest neighbors from a large number of arrays. These arrays typically contain feature values. A feature is a signal which has some amount of (ideally large) positive or negative correlation with the output class ? say the risk level of an insurance customer, or the risk of cancer in a patient. Nearest neighbor search finds a few arrays that are closest to the array in question from a large database of arrays. Once the nearest neighbors are found, the class of the array in question is determined based on the neighbors. (highest, lowest, mean, median of the neighbors could all be used depending on the application)

    For duplicate document detection, the array would contain a ?1″ if a word was present in the document and a ?0″ if the word was not present. If there are a 100 thousand words in the dictionary (after removing stopwords and very high frequency words), each document becomes an array of a 100 thousand binary values. For each document, the problem now is to find its close neighbors! This is also a difficult problem because there are billions of documents and because of the curse of dimensionality. There are two ways of handling this, one using trees and the other using hashing. Trees do well for arrays that have a small number of values, but hashing is the prefered technique for handling arrays with very high number of values.

    A hash function converts a string into a number, a number that represents a bucket. By converting a string to a bucket, searching for a string in a large number of strings becomes a very fast operation, since we just have to convert a string to a number, and then look if that bucket is empty or full. But hash functions are not perfect, and sometimes several strings hash to the same bucket. This is called a collision. Now what if we could write a hash function that assigns strings to the same bucket if they are in some way similar? Then we could find similar strings very fast from billions of strings!

    This principle is applied to the binary arrays we talked about earlier, and locality sensitive hashing (a name given to hashing arrays instead of strings) is applied to the billions of arrays. Now given a document, we just hash it, and look for similar documents that are in the same bucket (collisions). The hash function in locality sensitive hashing is such that the probability of a collision happenning is much higher if the documents are similar. The problem of billions of documents still remains, but I?m sure Google has a farm of thousands of machines and can manage it easily (that is if it can sort a petabyte in 33 minutes :) ) So hashing ensures that duplicates and near duplicates can be found in a matter of milliseconds per document!
     
    • Thanks Thanks x 2
  2. nimmivh

    nimmivh Regular Member

    Joined:
    Jul 21, 2011
    Messages:
    313
    Likes Received:
    61
    Occupation:
    Merchant and Affiliate
    Location:
    Apple Store
    didnt understand anything :(
    Can you say it in simple language :/
     
  3. healzer

    healzer Jr. Executive VIP Jr. VIP Premium Member

    Joined:
    Jun 26, 2011
    Messages:
    2,366
    Likes Received:
    1,968
    Gender:
    Male
    Occupation:
    Marketing automation tools
    Location:
    Somewhere in Europe
    Home Page:
    Good stuff Thanks!
    +rep
     
  4. SEO20

    SEO20 Elite Member

    Joined:
    Mar 25, 2009
    Messages:
    2,017
    Likes Received:
    2,259
    Are you sure about this?
    How would you explain eCommerce websites selling duplicate content items is not getting slapped? Special rules?

    I'm sure they use some sort of hashing too - but I am also pretty sure that it's not THAT simple as stated here - just saying.
     
  5. bigwhite

    bigwhite Regular Member

    Joined:
    Sep 27, 2011
    Messages:
    473
    Likes Received:
    54
    Lyrics websites as well. How would Lyric websites ever be ranking so high if they all provide duplicate content?
     
  6. TZ2011

    TZ2011 Senior Member

    Joined:
    Jun 26, 2011
    Messages:
    832
    Likes Received:
    864
    Occupation:
    Cleaning servers
    Its good to put source of the article
    Code:
    http://www.prodigalwebmaster.com/2011/10/how-does-google-detect-duplicate-content/
     
    • Thanks Thanks x 6
  7. dewaz

    dewaz Regular Member

    Joined:
    Nov 27, 2011
    Messages:
    399
    Likes Received:
    47
    Home Page:
    I would like to know this too.

    What op told us is one of basic concept about google duplicate content checking. I believe this 'duplicate content checker' is part of huge system of google search machine with some human interfere.
     
  8. imperial444

    imperial444 Elite Member

    Joined:
    Jan 13, 2011
    Messages:
    1,771
    Likes Received:
    414
    Occupation:
    Full-time IM hero
    Come on guys, they just do a manual check for duplicate content. that's it
     
  9. Union

    Union Power Member

    Joined:
    Sep 24, 2011
    Messages:
    531
    Likes Received:
    210
    Location:
    USA
    Mate, i know about this changes... But in your Title you tell us about ANSWER, where is the answer on how to cheat on Panda ? Nobody care's about their algoritm, cause not many understand what is this about... they need to know how to Cheat and Save their $$$ :)
     
  10. Union

    Union Power Member

    Joined:
    Sep 24, 2011
    Messages:
    531
    Likes Received:
    210
    Location:
    USA
    Ohhh, here is the answer :))))

    And of course in this case, Google Hired up to 10 people to check only my Auto Blogs :)))

    Common, tell us this was a joke :)
     
  11. livapetr

    livapetr Junior Member

    Joined:
    Oct 17, 2009
    Messages:
    196
    Likes Received:
    29
    Occupation:
    Seeking for bucks on the internet.
    Location:
    Oil-rich country
    Yea, have read that on other website.

    This is not all. They also employ n-gram theory to detect..... ready? ..... SPUN CONTENT!!! Ta-da.

    Won't quote here as it's as complicated as OP's post. The only thing we must know here is that the algo is very rough and imprecise and you need to have near 80% of site's content spun flat to get on the radar. That's where you'll be taken cared of by a human.

    Also, some guy here wrote that he knows such a reviewer and that this person has whole LOTTA pages to review and barely have time to review in detail and therefore checks roughly as well.

    PS. Oh, yea, forgot. If you think that your {realize|actually realize} or {and|&} or {a lot of|a whole lot of} makes your spintax to some extent unique you're wrong :) Words like "actually", "this", "that", "of" etc don't count.
     
    • Thanks Thanks x 1
    Last edited: Apr 5, 2012
  12. Dan Da Man

    Dan Da Man Elite Member Premium Member

    Joined:
    May 31, 2011
    Messages:
    1,850
    Likes Received:
    937
    Occupation:
    Duh
    Location:
    San Diego
    Home Page:
    Why do people just copy other people's content? So freakin annoying!!

    You should get neg repped for trying to act like its your own
     
    • Thanks Thanks x 1
  13. walandio

    walandio Senior Member

    Joined:
    Jun 27, 2008
    Messages:
    1,198
    Likes Received:
    684
    Location:
    Pilipinas
    wow what an info!

    [​IMG]
     
  14. KraftyKyle

    KraftyKyle Jr. Executive VIP Jr. VIP Premium Member

    Joined:
    Aug 13, 2008
    Messages:
    1,942
    Likes Received:
    4,610
    Gender:
    Male
    Location:
    Unknown
    I remember the first time I copied an article...

    Code:
    http://www.prodigalwebmaster.com/2011/10/how-does-google-detect-duplicate-content/
     
    • Thanks Thanks x 1
  15. nbseo

    nbseo Junior Member

    Joined:
    Nov 18, 2010
    Messages:
    127
    Likes Received:
    11
    Occupation:
    SEO... my Food of life.. ;)
    if it so.. then have u imagine how many ppl they have to hire to do this work????
     
  16. hypertoxic

    hypertoxic Registered Member

    Joined:
    Dec 28, 2011
    Messages:
    73
    Likes Received:
    25
    Occupation:
    Self-Employed
    Location:
    Orion System

    It went straight over my head in a blink :)
     
  17. kaif0346

    kaif0346 Power Member

    Joined:
    Jul 13, 2011
    Messages:
    734
    Likes Received:
    93
    Occupation:
    free lancer, SEO, VA
    Location:
    battle field
    Home Page:
    Old school'd Mathematica problem..
     
  18. ghedman

    ghedman Registered Member

    Joined:
    Jun 14, 2010
    Messages:
    50
    Likes Received:
    2
    i still can't find the answer :p, but thanks for the informations!
     
  19. sagarpatil

    sagarpatil Regular Member

    Joined:
    Mar 2, 2009
    Messages:
    458
    Likes Received:
    278
    Nobody likes a smart ass.
     
  20. sagarpatil

    sagarpatil Regular Member

    Joined:
    Mar 2, 2009
    Messages:
    458
    Likes Received:
    278
    I just wanted to share the article, didn't come to my mind that I should have given link to the source article.

    You could have also said this politely. There is no need to be so rude and negative repping.