indexing issues

Robots.txt file doesn't even block crawlers, it's just a guidelines file that suggests crawlers to skip some pages. However, the crawlers can ignore these guidelines and crawl them anyway.

The ideal practice is to allow crawling of entire site that's accessible to users and then set canonical tag, noindex tag, etc. to tell search engines what to show in SERPs.
Use canonical tags or not, wil not fix it and by the way, wp not does this automatical, you need another plugin -- creates more of these messy urls...

Use a cms who not produce this messy urls, then you are beyond 90% of all these free selfhost website owners.
 
Try to add a parameter handling rule in GSC under the URL parameters section and tell Google those parameters don’t change page content. alternatively if your CMS allows conditional logic you can apply noindex only when a query string is present in the URL without affecting clean URLs.
 
Use canonical tags or not, wil not fix it and by the way, wp not does this automatical, you need another plugin -- creates more of these messy urls...

Use a cms who not produce this messy urls, then you are beyond 90% of all these free selfhost website owners.
Which CMS do you suggest as an alternative to Wordpress?
 
If your canonical tags already point to the main pages first check whether those parameter URLs are being linked internally or included in your XML sitemap Removing those links and excluding parameter URLs from the sitemap usually helps Google consolidate indexing over time even if it doesn't happen immediately.
 
If your canonical tags already point to the main pages first check whether those parameter URLs are being linked internally or included in your XML sitemap Removing those links and excluding parameter URLs from the sitemap usually helps Google consolidate indexing over time even if it doesn't happen immediately.
Yep,you can remove them in gsc, but after a short time, they are back again !
 
I got an interesting deep insights of my special AI prompt, it explaines why these big media news sites , who have more specific parameter urls ,get crawled and indexed by googlebot and our small sites often not;

High Crawl Demand (The Power of Popularity)
Google's main goal is to keep its search index fresh for its users. Big news sites enjoy an enormous advantage here:
  • Constant Freshness: They publish breaking news every few minutes. Google must crawl them continuously to avoid showing outdated info.
  • Massive Link Equity: Millions of external links from all over the internet point to major news sites. In Google's algorithm, high popularity directly signals a high need for frequent crawling.
  • Even if a news site has messy parameter URLs, Google's "want" to crawl the site is so exceptionally high that the bot is willing to wade through a lot of technical mess to find the gold.

  • High Crawl Capacity (The Power of Infrastructure)
    Crawl budget is heavily limited by how much traffic a website's server can handle without crashing.
    • Enterprise Servers: Major media outlets run on premium, lightning-fast content delivery networks (CDNs) and dedicated server architectures.
    • Parallel Processing: Because their servers can process thousands of requests per second without slowing down, Googlebot increases its crawl rate limit. It can crawl their messy parameter URLs and their actual articles simultaneously without hurting the user experience.
    • Smaller sites often use shared hosting. If Googlebot hits a smaller site too hard, the server slows down or errors out, forcing Google to immediately back off and lower the crawl budget.
Interesting isn't it ? So, google first crawls "them" his friends who are in the same lobby club (WEF) , then much later.......our small sites.... and beware of that insight, cheap slow webhosting and a cloudflare cdn (who is not so bad, but not a real first class cdn) , are out of the race here. (slow cheap webhlosting will work when you want rank local).

So when you with all warnings want to stay aside with your free wp and to try to rank global, then first build a quality site structure: topical authority, look at the structure of hubspot as example and have a damm fast server, ok dedicated servers costs a lot, when you not have that money then check your competitors on google page 1 with the free builtwidth tool, to see what slow webhosting they use / or slow old tech vps, then have a better much faster, to expand your web crawl budget,but build authorithy first! Then your indexing and crawling on free wp should became better again.

These big media news sites have sort of specific parameter urls with their expensive cms systems, for us who have a much smaller site better have a clean site structure cms who not produce that mess like; Framer, Webstudio, Squarespace, Ghost.......

I hope it helps !
 
Last edited:
I got an interesting deep insights of my special AI prompt, it explaines why these big media news sites , who have more specific parameter urls ,get crawled and indexed by googlebot and our small sites not;
High Crawl Demand (The Power of Popularity)
Google's main goal is to keep its search index fresh for its users. Big news sites enjoy an enormous advantage here:
  • Constant Freshness: They publish breaking news every few minutes. Google must crawl them continuously to avoid showing outdated info.
  • Massive Link Equity: Millions of external links from all over the internet point to major news sites. In Google's algorithm, high popularity directly signals a high need for frequent crawling.
  • Even if a news site has messy parameter URLs, Google's "want" to crawl the site is so exceptionally high that the bot is willing to wade through a lot of technical mess to find the gold.

  • High Crawl Capacity (The Power of Infrastructure)
    Crawl budget is heavily limited by how much traffic a website's server can handle without crashing.
    • Enterprise Servers: Major media outlets run on premium, lightning-fast content delivery networks (CDNs) and dedicated server architectures.
    • Parallel Processing: Because their servers can process thousands of requests per second without slowing down, Googlebot increases its crawl rate limit. It can crawl their messy parameter URLs and their actual articles simultaneously without hurting the user experience.
    • Smaller sites often use shared hosting. If Googlebot hits a smaller site too hard, the server slows down or errors out, forcing Google to immediately back off and lower the crawl budget.
Interesting isn't it ?
the small sites disadvantage is huge, especially on shared hosting package, my take is if you want to invest on SEO even on small site get at least a VPS :).
 
Which CMS do you suggest as an alternative to Wordpress?
My friend; Webstudio, Framer, Ghost, Squarespace.......they all have in common an edge more google trust compare to free wp, because they not have bad spammy neigbours
 
Last edited:
the small sites disadvantage is huge, especially on shared hosting package, my take is if you want to invest on SEO even on small site get at least a VPS :).
I would say (my intuition) have a fast environment tech stack vps like hostkey or liquidweb 4 cores and 8-16 GB RAM, a lot of bandwith a lot, not these old tech slow clocking vps form ionos, contabo and that stuff, use it with a fast CDN not cloudflare, that stack will cost you monthly billing around $38-55, yes I know it's sad, it cost more money to spend on modern tech stack to get more crawl budget withyour free selfhost wp and that's not a guarantee ! Avoid these cheap traps who are bad for seo like CF, cheapest webhosting and these godaddyfokkers.......also have fast modern tech vps who can handle 100k visitors monthly would be my strategic decision.

And then not forget you must do all the maintenance, update sand backups with the vps and selfhost wp yourself. Better use wp only with 1 plugin= rankmath, all other plugins only via code in a mu-plugin folder in your cpanel or free cloudpanel and a FSE theme, I know their workaround is messy arkward with working on their templates , but it's the only way to design your page without page builder plugins who will create more of these messy urls.
 
Last edited:
Think twice about buying cheap webhosting (exclude you want rank local); imagine you have 2000 visitors daily to your site which will be a heavy burden for your cheap webhsoting performance. Then the googlebot comes and want to crawl your site , it finds tons of messy parameter urls and a slow cheap webhosting, so then the bot stops crawling before wasting crawl budget, it can't crawl fast a lot of your pages/post because your hosting is slow and using cloudflare makes it not better.

Check your competitors on google page #1 with Builtwith free tool so you can see their server environment and what type of cms they use, free selfhost wp, or wordpress.com or others?

Friends think twice about to have a clean high quality site structure and fast server+cdn environment (I mean here a real CDN not a proxy).
 
if google is indexing parameter/filter urls that your site allows crawling, you need to stop google from treating them as distinct pages. doing only robots disallow often doesn’t remove already-known urls, and google may still index some based on signals
 
My website has started indexing multiple filtered pages, such as "?from" or "?category".

As I understand it, robots.txt only prevents Google from crawling pages, but it does not prevent them from being indexed.

For some reason, Google still decided to index these URLs, so using disallow in robots.txt no longer seems effective in this case.

Using a noindex, follow tag is also not an ideal solution because, in my CMS, I cannot apply it only to those specific filtered pages. If I add it globally, the main pages would become noindexed as well.

The canonical tags are already set correctly and point to the main page instead being self-canonicalized.

How could I solve this issue?
Internal link may be the real culpit
 
You've diagnosed it right, that's the hard part. robots.txt blocks crawl not index, and canonical is only a hint to Google not a directive, so parametered URLs still slip into the index. The fix you actually want is a targeted noindex on just those URLs, and since your CMS won't let you scope it, do it at the HTTP layer instead of the meta tag.

Send an X-Robots-Tag: noindex header on any URL that contains a query string like ?from or ?category. You can match by URL pattern at the server (nginx/Apache rule) or at the edge with a Cloudflare Worker if you're on CF. That noindexes exactly the filtered pages and never touches your clean main URLs, which is the scoping problem your CMS can't solve.

One more thing: kill the internal links pointing at the filtered versions so you stop feeding them to the crawler in the first place. Header noindex plus no internal links, and they drop out over a few crawls.
 
Back
Top