1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[ANSWER] So What Is Google's Sorting Method And Why Does It Take So Long

Discussion in 'Black Hat SEO' started by Scritty, Oct 5, 2013.

  1. Scritty

    Scritty Elite Member Premium Member

    Joined:
    May 1, 2010
    Messages:
    2,807
    Likes Received:
    4,496
    Occupation:
    Affiliate Marketer
    Location:
    UK
    Home Page:
    And why are many of the interim states such a frikking mess... (with link to animation which shows why it looks so bad while in process)

    Here goes my explanation. Nothing below is "fact" I'm not claiming it to be. It is summization based on years of watching the process and listening to what Google say and other brainier people who "know this stuff" have worked out.

    -----------------------------------------------------------------------

    Been asked a couple of times on my forum, PM's and emails about why I think Googles sort process is iterative.
    What IS iterative sort? and.. Why don't Google just "set the order" in one go

    Here's what an iterative sort and my best shot at explaining why something like this is the ONLY effective way of doing it... and also a link to an animation which shows how and why some of the interim states that these methods cause are, for some data nodes, temporarily such a mess and so seemingly out of place.

    Penguin is links based, but there are many criteria in links. Authority, relevance, age, anchor text, platform, context, follow attribution
    So not only is there a sort - it is a multi criteria sort or a weighted criteria sort.
    In other words Google has to assess each URL against each possible keyword on several or many different criterias.
    Now it may give a URL this weighted result and do a one pass sort.
    Or it may do the most heavily influential criteria first then refine a URL's position within a given scope with iterations of sort based on increasingly less important criteria.

    Either way - the sort process itself is long winded and NOT instant fix.

    The likely process used is something along the lines of a partition exhange sort (Quick sort) I say this because large chunks are moved out and back in. We know (for instance) that Google treats the top 10 results as different (one "heap" or partition) and also the rest of the top 20. This lends itself to doing a heap sort.

    Without access to a carbon copy static version of the internet to do a "total data set swap" with. Google HAS to do an iterative sort of some kind. \Here's some of the reasons why this is so:

    a) The internet is not static. New URLS' are added at a rate of tens of thousands per minute, Others are changed. It's not possible to have a carbon copy of the internet to "practice" on then replace your sorted version for the real one with a push button
    b) The internet is dispersed as is Google's view of it in its directory listings. It isn't all in one place making the idea of a worldwide "single swap" at the least a logistical nightmare, but more likely just "not possible"
    c) The internet is FUCKING HUGE. Googles directory listings are also FUCKING HUGE
    d) Google are not allowed to "stop" the internet, they are not allowed to "download it on to their own servers for analysis and sorting" (much of it is copyright - and of course the minute it's downloaded...it's obsolete)

    Neither the internet or their index is fixed, can be fixed or would work if it was fixed. fixed.
    "Fixed" and "Static" are not states that the internet and Googles' view of it ever are... or ever COULD BE.
    A push button replacement of an entire data set that actually works and was worth doing 10 seconds later is just not possible.

    So they sort - and they have to sort the live data "on the fly" - a lot of it - trillions of URL's.


    • Set the criteria
    • Weight it
    • Start the sorting process off

    Sit down - take a deep breath. Consider that there is no choice other than an iterative sort that is realistically possible given the constraints I've outlined above.
    Now I don't know what method they use or what the real criteria are but "partition exchange sort" seems most likely.

    http://en.wikipedia.org/wiki/Quicksort

    It won't be exactly this. It won't be exactly anything you can find ona WIKI page. It will be a Google copyright method BASED on the sound logic of a standard sort process, and likely altered all the time itself as required

    If you look at the animation on this WIKI page you can see just how messed up some of the interim stages are. How outliers and "sore thumb" results stick out for many passes.
    Similar anomolies occur with most effective sorting methods (Bubble, Merge...etc)

    The take away?

    Sit back.. Chillax. If anyone tries to sell you a Penguin 2.1 beater in the next 4-5 days they are talking outta their ASS. No-one outside Google's head office will know what P2.1 really is, and even many of those inside won't understand it. In fact I doubt any one person has the IQ capacity to know all the possible ramifications. The algo is too complex and the data set to big and the situation too fluid.

    Scritty
     
    • Thanks Thanks x 13
  2. indianbill007

    indianbill007 Jr. VIP Jr. VIP

    Joined:
    Jan 8, 2010
    Messages:
    4,813
    Likes Received:
    4,051
    Occupation:
    Making Money when the world is sleeping
    Location:
    Menlo Park - Next to Zuck
    Golden piece of info there Scritty, every thing makes sense except the sorting algorithm which I believe will be merge sort, considering Google uses BigTable implementation(its own implementation of BIGDATA, modern day hadoop file storage).

    Merge sort is meant to sort high volume data in parallel processing environment, exactly the way Google's index is stored internally.

    Also merge sort is much more stable that quick sort in parallel environments.

    http://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_of_algorithms

    Although am sure, it will be google's custom implementation of merge sort as they have to run it in distributed and parallel environments.
     
    • Thanks Thanks x 1
  3. hip_hop_x

    hip_hop_x Jr. VIP Jr. VIP Premium Member

    Joined:
    Aug 27, 2009
    Messages:
    299
    Likes Received:
    61
    Occupation:
    Developer
    Home Page:
    I think you are wrong in your statement that they are using Quicksort, like most of the databases i think they search and keep their data in multiple binary trees. Mostly because of the efficient way to search in a binary tree where complexity is O(log n) and worst is O(n), while in quicksort worst case is O(n * n)

    So explained in a more "natural" fashion for everyone to understand:
    O(n) from 1000 results = 1000 things to do
    O(log n) it will be 3 (10 ^ 3 = 1000, log n = 3)

    Plus there's another advantage to b-tree, it's way more easy to cluster the data in there.

    For those who want to read more about b-tree:
    Code:
    [URL]http://en.wikipedia.org/wiki/Binary_search_tree[/URL]
    And btw, quicksort it's easy to paralleled as well as merge sort.
     
    • Thanks Thanks x 1
  4. Scritty

    Scritty Elite Member Premium Member

    Joined:
    May 1, 2010
    Messages:
    2,807
    Likes Received:
    4,496
    Occupation:
    Affiliate Marketer
    Location:
    UK
    Home Page:
    Yes - this looks a likely candidate. Still have interim states that look a complete mess - and it still takes time.
    I'm keen for poeple to just "get" those two things.

    Interim states for some can look terrible - don't sweat it till it's over
    It takes time - at LEAST a few days (P1 took weeks to settle)

    Scritty
     
  5. hip_hop_x

    hip_hop_x Jr. VIP Jr. VIP Premium Member

    Joined:
    Aug 27, 2009
    Messages:
    299
    Likes Received:
    61
    Occupation:
    Developer
    Home Page:
    I think that they compare results and do some changes in variables, like in the old update when random social pages with high DA started owning tops :) without any major seo work.
     
  6. alternatesword

    alternatesword Jr. VIP Jr. VIP

    Joined:
    Aug 25, 2012
    Messages:
    2,320
    Likes Received:
    484
    Location:
    scabbard
    Home Page:
    I hate the sorting algorithms during my studies. Now i am searching my college notes for definitions :D
     
  7. Scritty

    Scritty Elite Member Premium Member

    Joined:
    May 1, 2010
    Messages:
    2,807
    Likes Received:
    4,496
    Occupation:
    Affiliate Marketer
    Location:
    UK
    Home Page:
    One criteria we have to take into account with whatever sort they use is that - although the interim states can be pretty bad for some nodes. The process can never casue the index it's working on to be unavailable. The live data always has to be in a state where it can be presented to the end user (even if - for a short period of time - some of it is not in a good order)

    This does rule out some of the more powerful methods that I've had PM's about.
    I'm not sure Google's primary aim is to reduce either the number of swaps or the number of data set iterations. Their primary aim is to always be "live" which - with the sort going on in the background while you are presenting the data - is not as easy. And probably, despite our noticing the fallout of the sort - not to look as if they have completely lost the plot at any point. Ranking a XXX site for a riding school search term, cause it contains word "whip" a lot (not a good example - xxx sites are already pigeon holed... but you get the idea)

    I'm pretty certain people are being re-directed to other data centres at critical times (meaning search results in different data centres WILL vary)

    My take away isn't a discussion of the exact sorting method. It's that sort is a process that occurs over time - and, like making an omlette, often involves making a mess before it's over.

    Great input so far folks :)

    Scritty
     
  8. chumpss

    chumpss Regular Member

    Joined:
    Sep 6, 2010
    Messages:
    356
    Likes Received:
    84
    Thanks for the post!
     
  9. subster

    subster Elite Member

    Joined:
    Apr 5, 2008
    Messages:
    1,864
    Likes Received:
    1,448
    Location:
    Krauthausen
    Perfect post, with a good shot in the right direction I think.
     
  10. ibmethatswhoib

    ibmethatswhoib Jr. VIP Jr. VIP Premium Member

    Joined:
    Feb 17, 2011
    Messages:
    1,560
    Likes Received:
    1,156
    Occupation:
    Staying Informed
    Location:
    Bay Area, Ca
    Home Page:
    I'll comeback to this when I wake up some, not sure what the O(n) O(log n) iterative sort is going on.
     
  11. ficfroc

    ficfroc Regular Member

    Joined:
    Feb 14, 2010
    Messages:
    475
    Likes Received:
    267
    Location:
    Sous Les Etoiles
    How about all this sorting algorithm is already done in a sandbox, 3 or 2 weeks ago ? Then the results show in the sandbox.

    If they are satisfied about them they just release it, tweet about it and let propagate around the whole datacenters.

    and now they are just propagating throught the whole datacenters and the shuffling we are seeing is only a result of data not have propagated totally ?
     
    • Thanks Thanks x 1
  12. indianbill007

    indianbill007 Jr. VIP Jr. VIP

    Joined:
    Jan 8, 2010
    Messages:
    4,813
    Likes Received:
    4,051
    Occupation:
    Making Money when the world is sleeping
    Location:
    Menlo Park - Next to Zuck
    That is obviously done, call it sandbox or dev environment where they have access to almost all the data which is in the LIVE index and they are able to see results of tweaks they make in the sorting algo. Matt Cutts in many of his talks says - "while debugging a query", where do you think debugging is done? Not on the live index obviously.
     
    • Thanks Thanks x 1
  13. Scritty

    Scritty Elite Member Premium Member

    Joined:
    May 1, 2010
    Messages:
    2,807
    Likes Received:
    4,496
    Occupation:
    Affiliate Marketer
    Location:
    UK
    Home Page:
    There is latency as datacentres shuffle, but the idea isn't possible to do in a sandbox and transfer (for the reasons I list in the OP)
    THe data set would be hugely out of date. Google would have to cease indexing data or "roll back" the sort as tens of thousands of URL's are added AND INDEXED every MINUTE

    Taking a snapshot, sorting it and replacing it just a day later would mean having to double index or roll back millions of URL's (and they don't do that)
    Also the sort is multi dimentional

    Each URL is rarely only represented once.
    It's represented for every term it might rank for.... (so think of the whole thig as a 3d data cube and not a list)
    It's the TERMS that are the key fields and every term will have a list of sites ranking for it, some terms have MILLIONS of sites listed for them (as you'll see estimated at the top of the Google search bar

    So yes - Google does practice on it's own archived or dummy data to get a macro view of possible outcomes and make sure nothing really stupid happens...well it tries. But the sort is live.
    In fact the sort never ends, P2.1 is just a speedbump.

    The fact of Googles iterative approach to sort is not really up for debate. They openly say that this is how it's done and the evidence of the shfts in both scale and type - if interrogated show that this is how it's done.

    This post isn't trying to prove iterative sort. They are a fact.

    It's an attempt to get those people panicking and worrying too much to sit back and see what happens and explain that the process will likely take a few days to settle by explaining in laymans terms what's going on. :)

    Scritty
     
    • Thanks Thanks x 1
  14. c0bbz

    c0bbz Registered Member

    Joined:
    Jul 13, 2013
    Messages:
    66
    Likes Received:
    26
    Location:
    UK
    Reading this post is just what I needed. One of my sites got hit down a few pages for each of my keywords, so I'll just sit tight and wait it out.
     
  15. UNCLEBUCK

    UNCLEBUCK BANNED BANNED

    Joined:
    Jan 28, 2013
    Messages:
    605
    Likes Received:
    254
    I understand all this.. BUT.. What is the logical reasoning behind some
    websites just staying static, no movement at all during or after the update..?? And others Bounce around?
     
  16. Adam718

    Adam718 Senior Member

    Joined:
    Jun 19, 2012
    Messages:
    1,015
    Likes Received:
    433
    Occupation:
    Freelance writer and SEO Specialist
    Location:
    NYC
    I'm fairly sure that the reasoning to that is because Google is not a perfect entity. There are always going to be websites and other types of variables that are going to slip through the cracks.

    Good post, Scritty.

     
  17. Scritty

    Scritty Elite Member Premium Member

    Joined:
    May 1, 2010
    Messages:
    2,807
    Likes Received:
    4,496
    Occupation:
    Affiliate Marketer
    Location:
    UK
    Home Page:
    They are not using websites as the key field.
    The search term is the key field

    They are only re-ranking a small percentage of search terms.

    Sites hang off a search term.. Search terms don't "flow from" a site

    Many URLs' rank for dozens of search terms at once. It is possible that some of your URL's HAVE changed position - just not for any search term you'd be interested in or at a level in SERPs high enough to see referrer stats for it. Some wierd phrase you used on a page, you used to rank 456 for now you rank 212 for. You'd never get traffic - or care.

    The 1 or 2% that Google always seem to pick on are commercial terms.

    Scritty
     
    • Thanks Thanks x 1
  18. UNCLEBUCK

    UNCLEBUCK BANNED BANNED

    Joined:
    Jan 28, 2013
    Messages:
    605
    Likes Received:
    254
    but in my cirumstances, my niches range from fishing.. to finance, from seo to plumbing...
    they are all been affected in some way
     
  19. ficfroc

    ficfroc Regular Member

    Joined:
    Feb 14, 2010
    Messages:
    475
    Likes Received:
    267
    Location:
    Sous Les Etoiles
    The best answer I could find, it does not matter to google as long as it is related to adwords .
     
  20. RosuC

    RosuC Regular Member

    Joined:
    Oct 9, 2012
    Messages:
    263
    Likes Received:
    112
    Location:
    Spain
    Home Page:
    I appreciate your thread Scritty - I may add that the implementing of Penguin 2.1 was made at 18:00 GMT 03 Oct 2013 - based on my observation regarding the way of how Google started to act. I don't have "inside" information but for sure it was something wrong with how Google started to act - even YouTube servers were affected about the new re-indexing.
    Now to talk about the algorithm of Penguin 2.1 is almost impossible to know his behavior so fast but for sure after some studies you may see some patterns in his comportment and way of treating different situations.
    I'm sorry for all of you who got penalized but it was normal to happen - I've said it in other threads as well - you should know that when you seed storm you get the hurricane

    All the best and don't stop to promote your websites