Need Help Understanding Google's Crawling Behaviour and Managing Broken Links

nichexposure

Elite Member
Joined
Jul 4, 2013
Messages
2,401
Reaction score
965
Hello Everyone,

I am seeking advice about a confusing issue with Google's crawling behaviour and broken links on our website.

Recently, we've noticed a significant increase in the number of broken links (6k to 30k in one week) reported by Google Search Console. Strangely, these broken links contain URLs with parameters that do not correspond to any existing pages on our website. For instance:

  1. https://www.example.com/stores/melb...on+Green&base_color[3]=Yellow&size_legacy=XXL
  2. https://www.example.com/stores/sydn...y[3]=9/10&size_legacy[4]=3/4&size_legacy[5]=M
We can see that these URLs have parameters such as "base_color" and "size_legacy", which are not supposed to be crawled according to our robots.txt directives.

Disallow: /*?size_legacy=
Disallow: /*?base_color=

Despite Google's directives, it crawls URLs with parameters, leading to broken links.

My main concern is that the base URL without these parameters (e.g., https://www.example.com/stores/melbourne/offers/summer-shop) is not even a valid URL on our website. I am confused about why Google is trying to crawl these variations.

I am currently searching for a solution to tackle this issue and would greatly appreciate any insights or suggestions from the community. Has anyone faced a similar problem before? What measures can we take to stop Google from crawling these non-existent URLs and generating broken link reports?

Thank you in advance for your help.
 
It seems Google is crawling non existent URLs with parameters on site, leading to broken links. you need to check the website's URL setup and make sure robots.txt file is set up correctly to stop this.
 
Hello Everyone,

I am seeking advice about a confusing issue with Google's crawling behaviour and broken links on our website.

Recently, we've noticed a significant increase in the number of broken links (6k to 30k in one week) reported by Google Search Console. Strangely, these broken links contain URLs with parameters that do not correspond to any existing pages on our website. For instance:

  1. https://www.example.com/stores/melbourne/offers/summer-shop?base_color[0]=Light+Green&base_color[1]=Neon+Green&base_color[3]=Yellow&size_legacy=XXL
  2. https://www.example.com/stores/sydney/offers/summer-shop?base_color[0]=Neon+Green&base_color[1]=Light+Pink&base_color[2]=Light+Green&base_color[3]=Multi&base_color[4]=Light+Purple&base_color[5]=Turquoise&base_color[6]=Yellow&size_legacy[0]=XS&size_legacy[1]=5XL&size_legacy[2]=11/12&size_legacy[3]=9/10&size_legacy[4]=3/4&size_legacy[5]=M
We can see that these URLs have parameters such as "base_color" and "size_legacy", which are not supposed to be crawled according to our robots.txt directives.

Disallow: /*?size_legacy=
Disallow: /*?base_color=

Despite Google's directives, it crawls URLs with parameters, leading to broken links.

My main concern is that the base URL without these parameters (e.g., https://www.example.com/stores/melbourne/offers/summer-shop) is not even a valid URL on our website. I am confused about why Google is trying to crawl these variations.

I am currently searching for a solution to tackle this issue and would greatly appreciate any insights or suggestions from the community. Has anyone faced a similar problem before? What measures can we take to stop Google from crawling these non-existent URLs and generating broken link reports?

Thank you in advance for your help.
Check your website's links to ensure they're correct, and see if there's a problem with how Google is reading your URLs to stop it from trying to visit pages that don't actually exist, causing broken links.
 
It seems Google is crawling non existent URLs with parameters on site, leading to broken links. you need to check the website's URL setup and make sure robots.txt file is set up correctly to stop this.

The website is built in Magento. The URL setting is correct, and the sitemap is up to date. There is no such URL in the sitemap. I believe the robots.txt file has been appropriately configured. Whatever parameters I discovered, I modified in the robots.txt file.


Check your website's links to ensure they're correct, and see if there's a problem with how Google is reading your URLs to stop it from trying to visit pages that don't actually exist, causing broken links.

What do you mean by "website links"? URL structure, or incoming links? The URL structure is fine; there are no issues. I notice no incoming links for these parameters. And these are only two examples. Google crawls a variety of parameters that do not exist on the backend.

do a full site audit with Screaming Frog, export the data and sort them with the status codes. You will get to know a lot of data.

Run an audit numerous times every week with screaming frog, netpeak spider, AHREFS, and SEMRush. No crawler has ever crawled such a URL. It's only on the Google Search Console.
 
Your Robot.text is infected with malicious coding. You need to back up the whole site, and then move it to a different server and the problem should be gone for good.


The website is built in Magento. The URL setting is correct, and the sitemap is up to date. There is no such URL in the sitemap. I believe the robots.txt file has been appropriately configured. Whatever parameters I discovered, I modified in the robots.txt file.




What do you mean by "website links"? URL structure, or incoming links? The URL structure is fine; there are no issues. I notice no incoming links for these parameters. And these are only two examples. Google crawls a variety of parameters that do not exist on the backend.



Run an audit numerous times every week with screaming frog, netpeak spider, AHREFS, and SEMRush. No crawler has ever crawled such a URL. It's only on the Google Search Console.
 
Check your robots.txt is spot-on. Google encounters URLs before seeing if they're blocked, so external generation of these links could lead to unintended crawling. The URL Parameters tool in Google Search Console lets you guide Google on handling specific parameters, potentially cutting down on unnecessary site crawls. Ensure no internal links or sitemaps are inadvertently generating these URLs. If external links are adding these parameters, keeping an eye out and requesting removal where possible is wise.
 
Back
Top