[GUIDE] How Google Works - Reverse Engineered & The Core Updates - PART 1

splishsplash

Elite Member
Executive VIP
Jr. VIP
Joined
Oct 9, 2013
Messages
3,117
Reaction score
12,596
( Note: This is a VERY long guide with a lot of explanations and research to backup a lot of claims made. It's not light reading, and many of you won't read it, but for those that do take the time to digest it you'll gain some really unique insights that you won't find anywhere else on the web.)

In this guide I'm going to explain how Google actually works, take you on a journey through the evolution of Google and give you insight into the core updates.

The first thing I will say is that in 2023 Google are HEAVILY employing machine learning.

This means, that today we can say :-

"Google updates are black boxes and what happens after them is 90% luck and 10% skill"

However, with this guide and understanding how it works you'll be able to push it to 30% luck and 70% skill :)

The internet is full of agencies talking about EAT, EEAT and EEEAT. Oh wait, we're only on 2 E's just now aren't we? My bad.


**The reality is, no one has a clue why some pages go up and some down with updates.**

Even Google. Me included. I don't know. There is not a list of ranking factors that you can check off.

It's machine learning.

It's like our brain. When you look at a monkey, you know it's a monkey. You don't know why you know. Sure, you can list out some attributes about the monkey, but ultimately, you just "know".

This is SEO in 2023. Machine learning is looking at a ton of data points and making decisions. We can never know for sure why page A ranks and page B doesn't.

BUT. What we can do with a deeper understanding of Google's current algorithm..(And it's not really an algorithm, it's more of a massive orchestration system, but we'll call it an algo for simplicity)

What we can do is make significantly better decisions at every stage of our website's campaign to maximize the chances of getting a favorable result.

Read on to learn more..



It wasn't always like this


Let's take a walk through the history of search engines and how they ranked pages to understand what changes.


The 90's - Before Google


Originally search engines were just on-page factors.


It was just keywords. Keyword repetition. Title, h1, h2, h3, bold, italic, first paragraph + repeat it more.


SEO's ranked pages with keyword stuffing. It was simple back then. There just wasn't many opportunities for monetization :) It was easy to rank, hard to monetize. Polar opposite of 2023.


Google is Founded - 1998


What made Google into the giant they were was that they came along with this "PageRank" algorithm and started to rank web pages according to their PR, which was calculated from links.


This is sort of why people still think you rank pages with links. But even back then you still didn't rank pages with links technically. You just sent PR into a page, and it flowed to connected pages.


Results were much better in Google, which is why it became the de facto standard search engine.


They still decided what pages to rank for what keyword based on the simple on-page factors.


Imagine you had 25 pages, all competing for the same keyword "buy toasters". Instead of ranking the ones that seemed more relevant. Which in the case of the original search engines meant they had more "buy toaster" words in them, instead, google ranked those 25 pages based on their PageRank, calculated from their inbound links.


Link Spam is Born


As you can imagine this then started the industry of link spamming and the war of SEO's vs Google began.



Florida Update - Google's first strike - 2003


Google's first ever major update and attempt to hinder spammers.


It was rolled out in November 2003 before Christmas, and there were many civilian casualties.


Many poor small innocent retailers were wiped out..


Even back then Google couldn't get this right.


Officially this was an update against keyword spam, hidden text and other seemingly easy to spot blatant manipulation attempts.


However, unofficially, SEOs knew this was a link spam update.


In 2005 at a Pubcon event in New Orleans Google engineers actually admitted they were using statistical link analysis to detect spam sites.


This is EXACTLY why I keep telling people to look at the statistically most natural anchors, pages that are linked to, link patterns, articles that contain links etc. Very few people listen.


The difference today is, the statistically natural data points are discovered by machine learning.


Here's a paper from 2004 - https://www.microsoft.com/en-us/research/wp-content/uploads/2004/06/webdb2004.pdf



Now, what makes anyone think Google suddenly stopped using this approach? It's the most sane approach to attempt to discover spam. It's just not easy to do and you always get false positives and true negatives, which is why good sites drop and bad sites don't in many cases.

Google actually promised not to release any core updates before Christmas again because of the aftermath of this. These days, they LOVE their nov/dec updates just to stir shit up before Christmas. December is the single worst month for core updates and has been since 2020.

They actually kept their promise until they did a Panda refresh in Nov 2011.


Jagger & BigDaddy - 2005

More link updates came in 2005..

Between 2003 and Feb 23rd 2011 there wasn't much that changed.

These were the glory days of SEO. Ahh, to go back in time to 2005 to 2007. Before that not so much, as there wasn't so many offers and less people online. At least for easy money that was the time.

Ranking back then was simply links, keyword density and EMDs. Yeah they released some updates to combat link spam, but it was much easier to rank. You just needed to do a few simple things to stay under the radar and then you were free to spam with xrumer.

Panda Update - 2011

This was the beginning of things to come. Not a nice update and the beginning of Google getting aggressive and consistent with the frequency of updates.

Panda was a site quality/content update. It was the end of content farms. At least the content farms of the 00's

Penguin Update - 2012

Things went from bad to worse in 2012 when we got the Penguin update. This IMO was the official end of the SEO glory days. The final nail in the coffin. Yeah, 2012-2013 was still a cakewalk compared with today, but compared to what it was before that it was much harder.

Penguin was a modifier to the core algorithm that aimed to penalize sites that were employing spammy link building tactics.

Today, Penguin doesn't exist btw. It's been rolled into the core algorithm.

This was the beginning of exact anchors being dangerous.

Back then people would talk about finding acceptable anchor text ratios and padding out with brand/naked.

Yeah, that worked then. It doesn't work now because we have machine learning.

As you can see as we go through update after update, Google employ more and more techniques to identify websites that are spamming.

In my opinion, it all started with the idea of "hey, let's try to identify statistically what quality websites do, and then base everything off that" -- This is a powerful idea, and it's likely at the core of every major update right through to the machine learning updates. It is the best way to do it, bar none.

The problem with statistics is, I can say, statistically a baby born in a 3rd world country is less likely to become a millionaire than a baby born in NYC. This is statistically true, but logically it doesn't mean "every person born in a 3rd world country will never become a millionaire". This is why sites that aren't doing anything wrong get hit by updates.

Even Panda. The job of Panda was to identify low quality/thin content. How do you do that? Especially in 2011 long before modern transformer machine learning models.

They likely chose a bunch of data points for assessing quality and statistically compared manually flagged high quality articles with manually flagged low quality, then rolled that out.

Google is ultimately a data/statistics company. They just moved to using machine learning to analyse the data. They aren't really a search company. Their entire premise is working with data points and data to get the most people to keep coming back to the search engine, and get the most eyeballs on ads.


Hummingbird Update - 2013

Before hummingbird you would need separate pages for every longtail query.

Even "best toasters", "toaster reviews" and "top toasters" you would have been better off creating separate pages for.

Google was simply ranking keywords to pages and looking for all the keywords on the page. It was rudimentary and simple and as the web grew, more regular people started searching, which meant there were way more natural language searches.

In the past, web users were the early adopters and had learned to search in a way that computers would understand like

toasters list buy

If you wanted a list of toasters to buy. You would never in the 90's search for "what's the best toaster for students". It just wouldn't work at all.

If you wanted to try and find that you'd do

toasters list buy +students

Hummingbird was basically a re-write of google's core algorithm and completely changed how it works.

From this point on Google would now try to match the searcher intent with pages. Yeah it wasn't anywhere near as good as it is today, but this was the beginning of user intent matching.

If you searched for "how can i clean an old toaster" it would look for the key words, like "clean" and "toaster" and try to match a page that was talking about cleaning a toaster instead of a page that has the words "How can I clean an old toaster" in the title.

This was the beginning of changes in how we do SEO that's more inline with modern SEO.

RankBrain Update - 2015

Google's first machine learning update.

This was created to help Google workout the best pages to rank search queries for. This is where Google started moving beyond just keywords.

Hummingbird was more about extracting the important keywords from search queries, but it was still keyword based. This was a machine learning update where Google would use machine learning to work out

Up until RankBrain the entire Google algorithm could be boiled down to "How many times does the search keyword appear on the page and in the anchors".

THIS is why anchors used to be so important.

THIS is why Penguin was created.

And THIS is why anchors are no longer important in 95% of cases other than as a signal to detect how natural your link is. Because Google started using machine learning to understand search queries instead of matching keywords to pages.

Google has a graph database full of entities and facts.

As of 2023 it contains 8 billion entities and 800 billion facts.

When you search for

"how can I do content marketing as a career"

It isn't looking for keywords on a page.

It uses machine learning and entities to understand what page to give you.

Look.

Click on this - https://www.google.com/search?kgmid=/m/03qj473

Wow, look at what pops up. The serps for "content marketing". :)

That's the knowledge graph machine id for content marketing.

Now click - https://www.google.com/search?kgmid=/m/03ml62y

That's the kg machine id for career.

It understands these concepts/topics. It has information/connections on them.

It doesn't simply match pages with the title "How can I do content marketing as a career"

This page is #1 https://www.reliablesoft.net/get-into-content-marketing/


Not because it contains the word 'career' a bunch of times.

But because Google knows one of the entities for the search query is /m/03ml62y and this entity, "career" is connected(that's what a graph database is, it's a bunch of connections/relations) to other entities and facts..

It uses the knowledge of those connected entities and facts to workout how relevant the page is.

Let's look at this page

It contains entities such as "industry", "company", "income", "job".

Check this - https://share.getcloudapp.com/Z4uG7Azj

You see how it's highlighted "job", "content marketing", "degree". It's highlighting entities it considers as related.

Now, back when RankBrain was released this was kind of as far as it got.

It was the Knowledge Graph with RankBrain. The basic machine learning would try to figure out what the searcher wanted and used the Knowledge Graph entities to help with that.

Today it's much more advanced and the Knowedge Graph contains a huge amount of data. 800 billion+ facts and growing all the time.

This btw, is how Google is able to measure "topical authority".

I discussed this in my other guide to what topical authority really is - https://www.blackhatworld.com/seo/a...e-question-what-is-topical-authority.1450324/

This is EXACTLY why PAA sites dominated until Google released some updates to hamper them.

Because the Knowledge Graph is pure facts. It's pure Q&A. So when you have a site that's PAA(pure Q&A) it matches up very closely with the knowledge graph so you end up with astronomical topical authority.

You don't rank PAA's because they're low comp. Try creating a site with 1000 niches, 10 articles per niche, 10k articles total. It won't do particularly well.

It's also around this time that you absolutely could no longer have separate pages for 'best toaster', 'top toasters' and 'toaster reviews', because RankBrain's machine learning with the Knowledge Graph can figure out the entity is toaster and it understands reviews/best/top are very similar.

Can you prove what you're saying here?

There's actually not much information available on the knowledge graph. Google gives us access to entities via the API, but this is just a fraction of the complete thing.

Almost all articles are just re-hashed info about knowledge graphs, saying the same things.

Let's do a little research here.

If we look on https://en.wikipedia.org/wiki/Google_Knowledge_Graph it says

"The information covered by Google's Knowledge Graph grew quickly after launch, tripling its data size within seven months (covering 570 million entities and 18 billion facts"

This leads to a cnet article citation, which has a link to a blog post from Amit Singhal who was the head of the search team.

https://blog.google/products/search/introducing-knowledge-graph-things-not/
He says

"It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. And it’s tuned based on what people search for, and what we find out on the web"

Of course today we know it's 800+ billion facts, but the key thing here isn't the number, it's the other things he says.

So we can have confirmation here that it doesn't just contain "entities", but also facts about the entities, and furthermore relationships between those facts.

He says

"The Knowledge Graph also helps us understand the relationships between things. Marie Curie is a person in the Knowledge Graph, and she had two children, one of whom also won a Nobel Prize, as well as a husband, Pierre Curie, who claimed a third Nobel Prize for the family. All of these are linked in our graph. It’s not just a catalog of objects; it also models all these inter-relationships. It’s the intelligence between these different entities that’s the key."

"intelligence between these different entities" - Confirmation that they're focused on the intelligence between the entities, which means the machine learning and the knowledge graph are HIGHLY interconnected.

We also learn

"For example, the information we show for Tom Cruise answers 37 percent of next queries that people ask about him"

This confirms Google is actively looking at the chain of searches to discover user intent.

This means that if people search for "beginners guide to seo", then search for "what is anchor text", google will learn that a "beginners guide to seo" should contain a section answering "what is anchor text". The more of the user intent you can capture, the higher your rankings.

We could train our own machine learning algorithm to look at search queries, and the questions answered in the top 3 results and train the model on that to help it understand what a user might want for a new, unknown search query.

What about questions? Where do they fit in? So far in that blog post by Amit Singhal he hasn't mentioned this specifically.

I found this patent here - https://patents.google.com/patent/US10108700B2/ - "Question answering to populate knowledge base"

Here's one of the images - https://patentimages.storage.googleapis.com/5b/e8/e3/480c86196c5660/US10108700-20181023-D00000.png

As you can see it's trying to fill in missing information. So in the image it has a missing bit of data "architect" which it generates a question for, then passes that question to "query processing", which can only be one thing - Searching its documents that it's crawled and indexed on the web.

It then gets an answer and fills in the entry in the Knowledge Graph.

This shows us how Google is in fact using questions generation, and question answering to learn.

It learns from the web. It's highly likely based on what we've learned here that PAAs are in fact those questions it's generated from its own knowledge graph.

There's also another patent here that's connected called "Question answering using entity references in unstructured data" - This is the EXACT patent detailing featured snippets.

I used AI to summarize and explain this patent. Here's what it came up with.

"This patent describes a method and system for enhancing search results by adding entity references, which are visually distinct and can be located above the top-ranked search results.

The process involves:

1. Receiving a natural language search query and obtaining search results based on this query. These search results are ranked based on their relevance to the query.

2. Identifying a type of entity associated with the query. This type of entity, which can be a person, a location, or a date, defines a broad categorization that includes multiple specific entities.

3. Selecting one or more top-ranked search results.

4. Selecting an entity reference from the content of the top-ranked search results. This entity reference is a specific text that refers to a specific entity, and its selection is determined by the type of entity identified from the query.

5. Displaying this entity reference alongside the top-ranked search results, but visually distinguished from them, for example, positioned above the top-ranked search results."


And wow. Look at that. Sounds EXACTLY what a featured snippet is.

This also shows us how they are answering questions in unstructured data, which will be part of the "query processor" from the patent before this "Question answering to populate knowledge base"

BERT - First update - October 25th, 2019

This is when it all changed. The first transformer model.

Google said it impacts both search queries and featured snippets. What do we know about featured snippets? They generate questions and they create answers.

So they were using BERT to understand searcher intent, detect entities, generate questions and answer questions. They were then storing all this in the Knowledge Graph.


BERT - Core Update - May 4, 2020

I can tell you 100% what this update was about.

Topical authority.

Understanding what we know about the Knowledge Graph and question answering, then moving from their original more elementary machine learning in RankBrain to BERT in 2019 they now made full use of BERT to expand the Knowledge Graph, adding more intelligence to it and use that to really put "topical authority" into full play. This is when topical authority became the fucking king.

Here's another patent. It's an earlier one called "Clustering of search results".

This was filed in..

26th November 2019.

Haha...

A few months before the May 2020 update that was about topical authority from BERT?

1 month after they've done the first update to roll BERT out on the live index?

Now they have BERT, a technology that do exactly this? A technology that did not exist until now.

https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1
Now look at the first paragraph of the AI summary.

Here's the AI summary of the patent

"This patent application describes a method performed by a search engine to cluster search results based on their semantic relationships and similarity, and present these results in a structured, organized manner. The following is a simplified interpretation of each claim:

1. The main claim describes a method in which a search engine processes a query, identifies entities associated with the search results (items), obtains embeddings for these items, creates first-level clusters based on the identified entities, refines these clusters by merging them according to their ontological relationships and embedding similarities, and presents the final clusters.

2. This claim states that smaller first-level clusters are merged first during the clustering process.

3. This claim further elaborates on the previous claim by specifying how smaller clusters are selected and merged.

4. This claim states that the most similar first-level clusters are merged first.

5. This claim further elaborates on the previous claim by specifying how the most similar clusters are selected and merged.

6. This claim suggests that hierarchical clustering is applied to the merged first-level clusters when creating final clusters.

7. This claim adds that the ontological relationships between entities are used to adjust the similarity metric during the clustering process.

8. This claim further details how the similarity metric is adjusted, stating that the metric is boosted for clusters with ontologically related entities to favor higher similarity.

9. This claim states that the entities associated with an item can be identified before a query is received.

10. This claim explains that the entities associated with an item can be identified in the text associated with the item.

11. This claim clarifies that at least one entity associated with an item in the results is identified from the item's associated text.

12. This claim specifies that the items are mobile applications and the process of associating them with an entity is based on an application annotation service.

13. This claim provides a detailed series of steps for creating final clusters, involving generating intermediate clusters and cluster candidates before selecting the final clusters.

14. This claim further elaborates on the previous claim by introducing a third stage of generating cluster candidates based on a boosted similarity metric when the clusters have ontologically related entities.

15. This claim repeats the main claim but adds that each first-level cluster represents an entity in a knowledge base and includes items mapped to that entity.

16. This claim provides more detail about how the final clusters are created in claim 15.

17. This claim details what happens during the merging process in claim 16.

18. This claim provides an elaborate process for creating final clusters by merging first-level clusters and generating intermediate clusters and cluster candidates.

19. This claim specifies that in the context of claim 15, the items are mobile applications and the process of mapping them to entities is based on an application annotation service.

20. This claim recasts the main method claim (Claim 1) as a set of instructions stored on a non-transitory computer-readable medium, which when executed by a processor, causes a search engine to perform the stated operations. This claim targets a software product that implements the methods of the patent."


Quote from within the patent

"The embedding similarity between two search items can be represented as the cosine similarity within the embedding space."


Quote from https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1

"Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them"

So what do they have now?

Knowledge Graph full of questions and answers.

BERT with the ability to determine the similarity between sentences (questions and answers)

So all they have to do at this stage is use BERT to cluster.

We couldn't do this without BERT. Yes we had cosine similarity, but you can't create the dense vector representation of the sentences to actually do the cosine similarity until you have a sentence transformer model.

https://techblog.assignar.com/how-to-use-bert-sentence-embedding-for-clustering-text/
They either just created sentence based summaries of the documents and clustered those.

Or they did something more advanced using more questions and answers they have within the Knowledge Graph to cluster.

Whether they did this in May 2020, or more recently, it's definitely more sophisticated now, but this was the first major topical authority update.


OK, enough for now. In part 2 we'll continue on with the Dec 2020 update and the BERT killer. MUM, which rolled out in the June 2021 update.


Feel free to comment, give feedback, ask questions and chat about this. Share anything you know or if you think I'm talking shit you can go ahead and tell me. :)
 
I'm not into SEO, nor a programmer, & I suck at math & AI stuffs seems like Chinese to me, but I'm impressed by the time you invest into this & the insights you share with the community. Thanks for that man. :)
 
and where is the guide?

One thing is that Google understand entities, and knowledge graph and all of that stuff, but that doesn't mean and prove what they use for ranking.

Google main parts are: Indexing, Relevance and Ranking.

A lot of that NLP thing is what gurus use to make money selling courses, for me is just fluff.

and as a personal opinion, I dont like to ask AI to make a summary of a patent, you really miss so many golden nuggets.
 
and where is the guide?

One thing is that Google understand entities, and knowledge graph and all of that stuff, but that doesn't mean and prove what they use for ranking.

Google main parts are: Indexing, Relevance and Ranking.

A lot of that NLP thing is what gurus use to make money selling courses, for me is just fluff.

and as a personal opinion, I dont like to ask AI to make a summary of a patent, you really miss so many golden nuggets.

Thanks for me correcting me on how NLP is just guru-fluff and Google's main parts are just indexing, relevance and ranking.

Perhaps you should write a guide?
 
In the past, I liked to associate G with a very knowledgeable bookseller, who read and memorized every book in his entire library, and was ready to give an eloquent answer to the questions he received.

Nowadays, new skills have been added to this bookseller, including a better knowledge of the human being (putting together the information gathered about John Doe through his internet activity: who he is, where he is from, what his interests are, etc), associating John Doe (through his attributes) with other people in the same group (location, industry, etc), in order to anticipate the best answer, and offering a kind of creative curiosity (answering questions through other questions, looking for similarities in the results provided).

You shared a truly informative stuff here, @splishsplash . Many thanks!
 
Nice research, breaking down each update... patiently waiting for Part 2.

The biggest question, as with any update:
What would you say is the current most important way to work the system (the 70% skill part) to get a site ranking?
 
Thanks for the knowledge as always!

Yes I agree SEO is heavily impacted by luck and randomness these days because more random stuff has been added in the ranking factors such as ML, which is completely unpredictable.
 
Nice research, breaking down each update... patiently waiting for Part 2.

The biggest question, as with any update:
What would you say is the current most important way to work the system (the 70% skill part) to get a site ranking?

That comes more at the end of the guide where there are conclusions. I wasnt intending to do 2 parts but that took me almost 8 hours to produce so i had to split it up
 
Thank you Splishsplash for a great very well-written and informative post as always. You always take the time to write fantastic posts that we can all benefit from. I personally am finding seo a struggle these days, so any type of help or insight into where things are going is always welcomed... Thanks again, looking forward to part 2
 
This is one of the most interesting thing I've read in a very long time.
I'm in the internet marketing space since 2013, and even though I didn't get into SEO until 2021, I heard a lot of the terms before. I remember the days after updates where every SEO guy was in a state of constant panic attack.
Really looking forward to the next part.
 
Very nice! Glad I'm not the only one who spends a stupid amount of time thinking about these things.

The biggest questions have to be "What comes next?" and "How to future proof on and off page strategies?"

Machine learning is moving so fast, it's inherently unpredictable. (Great write up!)
 
Back
Top