[Easy Trick] How to Bulk Check if URLs are Indexed in Google Sheets

MatthewGraham

BANNED
Joined
Oct 6, 2015
Messages
1,759
Reaction score
2,753
Website
www.youtube.com
You can scrape a lot of things with Google Sheets -- here's one of many things you can do with this. Script searches for "site:domain.com" in Google to check if there are indexed results for the input URL.

No need to run proxies since Google is scraping on your behalf and then returning the data to you.

This base code can be modified to scrape many other things as well. Working well for Reddit, as well as a number of other websites.

Step #1: Open the Script Editor in Google Sheets
DwFAvbP

Go to Tools --> Script Editor... to open the script editor.

Step #2: Copy-Paste Code into Script Editor
Make sure to save the file with Ctrl+S. The code to copy-paste is below.

Code:
function checkIfPageIsIndexed(url)
{
   url = "https://www.google.com/search?q=site:"+url;
   var options = {
     'muteHttpExceptions': true,
     'followRedirects': false
   };
   var response = UrlFetchApp.fetch(url, options);
   var html     = response.getContentText();

   if ( html.match(/Your search -.*- did not match any documents./) )
     return "URL is Not Indexed";
   return "URL is Indexed";
}

nKb7Ugz

The file should look like the above once the code is inserted.

Step #3: Run the Function in Google Sheets

ShxlBtM

The function that was created in step #2 (checkIfPageIsIndexed(url)) can be run the same as any other spreadsheet function.

Step #4: Use the Output
Bonus: Use conditional formatting to add red/green highlight to cells depending on whether or not the input URL is indexed.

PYY4yFh

Your output should look something like this.
 

Attachments

  • step-1-script-editor.png
    step-1-script-editor.png
    56.6 KB · Views: 4,673
  • step-2-insert-code.png
    step-2-insert-code.png
    47.4 KB · Views: 4,584
  • step-3-input.png
    step-3-input.png
    12.2 KB · Views: 4,429
  • step-4-output.png
    step-4-output.png
    37.1 KB · Views: 4,507
Last edited by a moderator:
You can scrape a lot of things with Google Sheets -- here's one of many things you can do with this. Script searches for "site:domain.com" in Google to check if there are indexed results for the input URL.

No need to run proxies since Google is scraping on your behalf and then returning the data to you.

This base code can be modified to scrape many other things as well. Working well for Reddit, as well as a number of other websites.

Step #1: Open the Script Editor in Google Sheets
View attachment 100448
Go to Tools --> Script Editor... to open the script editor.

Step #2: Copy-Paste Code into Script Editor
Make sure to save the file with Ctrl+S. The code to copy-paste is below.

Code:
function checkIfPageIsIndexed(url)
{
   url = "https://www.google.com/search?q=site:"+url;
   var options = {
     'muteHttpExceptions': true,
     'followRedirects': false
   };
   var response = UrlFetchApp.fetch(url, options);
   var html     = response.getContentText();

   if ( html.match(/Your search -.*- did not match any documents./) )
     return "URL is Not Indexed";
   return "URL is Indexed";
}

View attachment 100449
The file should look like the above once the code is inserted.

Step #3: Run the Function in Google Sheets

View attachment 100450
The function that was created in step #2 (checkIfPageIsIndexed(url)) can be run the same as any other spreadsheet function.


Step #4: Use the Output
Bonus: Use conditional formatting to add red/green highlight to cells depending on whether or not the input URL is indexed.

View attachment 100451
Your output should look something like this.


Cool shit bro, could you do me the favor and explain the code to me?
Like I understand what it does but I want to know every little part contributes to the whole, you know what I mean

Especially this part

var options = {
'muteHttpExceptions': true,
'followRedirects': false
};
var response = UrlFetchApp.fetch(url, options);
var html = response.getContentText();

if ( html.match(/Your search -.*- did not match any documents./) )


And one more thing, how would anyone go about starting to learn stuff like this, any books/articles you would personally recommend?
 
Damn I am wondering how much your brain can contain a lot of such weird but useful magics?!
 
Cool shit bro, could you do me the favor and explain the code to me?
Like I understand what it does but I want to know every little part contributes to the whole, you know what I mean

Especially this part

var options = {
'muteHttpExceptions': true,
'followRedirects': false
};
var response = UrlFetchApp.fetch(url, options);
var html = response.getContentText();

if ( html.match(/Your search -.*- did not match any documents./) )


And one more thing, how would anyone go about starting to learn stuff like this, any books/articles you would personally recommend?

Added a few comments explaining the details below. You might also check out the official Google Apps Script documention here:
https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app

Also, Google Apps scripts use Javascript, so any JS documentation applies to Google Apps scripts as well. This script is a bastardization of a different script that I used to bulk check page codes / bulk check whether pages return 404. And that script is a bastardization of some other person's similar script, which was a bastardization of the official Google Apps documentation, which was a bastardization of some JS documentation, which was a bastardization of... probably some Stack Exchange threads, which were in and of themselves bastardizations of... something, which was a bastardization of...

To answer the question about recommended reading / ways to learn programming/scripting, the best way is to decide what you want to do and then Google search your issue until you find some code that you can hack together into something usable. Start by building scripts and then figure out how to actually make them work as you go.

Code:
function checkIfPageIsIndexed(url)
{
   url = "https://www.google.com/search?q=site:"+url;

   // The options parameter for the function UrlFetchApp.fetch() is an optional parameter.
   // This whole section could probably be removed now that I think about it. Left it in since it was needed for a different similar script to bulk check page code.
   var options = {
   // if this is set to true, the fetch will not throw an exception if the response code
   // indicates failure, and will instead return the HTTPResponse (default: false).
   // This could probably be omitted from this script. IIRC, I originally put this in for a
   // similar script to bulk check page code (404/200/403/other/etc), which would
   // throw an error if the page failed to load with a 200 code.
    'muteHttpExceptions': true,
   // if this is set to false, the fetch not automatically follow HTTP redirects;
   // it will return the original HTTP response. The default is true. E.g., if you load an
   // http page, but the page 301s to an https version, this determines whether to follow
   // the 301 or to return the original page's data.
    'followRedirects': false
   };

   // Makes a request to fetch a URL.
   var response = UrlFetchApp.fetch(url, options);

   // From the HTML that has been retreived, gets the content of the HTTP response encoded as a string.
   // https://developers.google.com/apps-script/reference/url-fetch/http-response#getContentText()
   var html     = response.getContentText();

   // Now that the HTML is encoded as a string, it can be searched using a regex
   // https://en.wikipedia.org/wiki/Regular_expression
   // html.match() searches for all sections of the the html string that match the pattern inserted
   // If a site:url.com search returns no results, Google will instead return an error that looks like the image below this code
   // It should be something like "Your search -" + [INSERT YOUR SEARCH HERE] + "- did not match any documents"
   // the "." indicates any character in a regex.
   // A regular expression followed by an asterisk (*) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.
   // http://users.cs.cf.ac.uk/Dave.Marshall/Internet/NEWS/regexp.html
   // .* therefore means up to unlimited of any character
   // So this regex returns matches for any page that includse "Your search -" followed by up to unlimited arbitrary characters followed by "- did not match any documents."
   // Which will evaluate as being true for Google searches that did not yield any results.
   if ( html.match(/Your search -.*- did not match any documents./) )
    return "URL is Not Indexed";
   return "URL is Indexed";
}

google-domain-search-failure.png


Bonus: The Script to Bulk Check URL Status Code / Bulk Check 404 Errors that this Script was Bastardized From
Code:
// https://medium.com/@the.benhawy/how-to-use-google-spreadsheets-to-check-for-broken-links-1bb0b35c8525
function getStatusCode(url){
   var options = {
     'muteHttpExceptions': true,
     'followRedirects': false
   };
   var response = UrlFetchApp.fetch(url, options);
   return response.getResponseCode();
}
 
Last edited:
Added a few comments explaining the details below. You might also check out the official Google Apps Script documention here:
https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app

Also, Google Apps scripts use Javascript, so any JS documentation applies to Google Apps scripts as well. This script is a bastardization of a different script that I used to bulk check page codes / bulk check whether pages return 404. And that script is a bastardization of some other person's similar script, which was a bastardization of the official Google Apps documentation, which was a bastardization of some JS documentation, which was a bastardization of... probably some Stack Exchange threads, which were in and of themselves bastardizations of... something, which was a bastardization of...

To answer the question about recommended reading / ways to learn programming/scripting, the best way is to decide what you want to do and then Google search your issue until you find some code that you can hack together into something usable. Start by building scripts and then figure out how to actually make them work as you go.

Code:
function checkIfPageIsIndexed(url)
{
   url = "https://www.google.com/search?q=site:"+url;

   // The options parameter for the function UrlFetchApp.fetch() is an optional parameter.
   // This whole section could probably be removed now that I think about it. Left it in since it was needed for a different similar script to bulk check page code.
   var options = {
   // if this is set to true, the fetch will not throw an exception if the response code
   // indicates failure, and will instead return the HTTPResponse (default: false).
   // This could probably be omitted from this script. IIRC, I originally put this in for a
   // similar script to bulk check page code (404/200/403/other/etc), which would
   // throw an error if the page failed to load with a 200 code.
    'muteHttpExceptions': true,
   // if this is set to false, the fetch not automatically follow HTTP redirects;
   // it will return the original HTTP response. The default is true. E.g., if you load an
   // http page, but the page 301s to an https version, this determines whether to follow
   // the 301 or to return the original page's data.
    'followRedirects': false
   };

   // Makes a request to fetch a URL.
   var response = UrlFetchApp.fetch(url, options);

   // From the HTML that has been retreived, gets the content of the HTTP response encoded as a string.
   // https://developers.google.com/apps-script/reference/url-fetch/http-response#getContentText()
   var html     = response.getContentText();

   // Now that the HTML is encoded as a string, it can be searched using a regex
   // https://en.wikipedia.org/wiki/Regular_expression
   // html.match() searches for all sections of the the html string that match the pattern inserted
   // If a site:url.com search returns no results, Google will instead return an error that looks like the image below this code
   // It should be something like "Your search -" + [INSERT YOUR SEARCH HERE] + "- did not match any documents"
   // the "." indicates any character in a regex.
   // A regular expression followed by an asterisk (*) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.
   // http://users.cs.cf.ac.uk/Dave.Marshall/Internet/NEWS/regexp.html
   // .* therefore means up to unlimited of any character
   // So this regex returns matches for any page that includse "Your search -" followed by up to unlimited arbitrary characters followed by "- did not match any documents."
   // Which will evaluate as being true for Google searches that did not yield any results.
   if ( html.match(/Your search -.*- did not match any documents./) )
    return "URL is Not Indexed";
   return "URL is Indexed";
}

View attachment 100470

Bonus: The Script to Bulk Check URL Status Code / Bulk Check 404 Errors that this Script was Bastardized From
Code:
// https://medium.com/@the.benhawy/how-to-use-google-spreadsheets-to-check-for-broken-links-1bb0b35c8525
function getStatusCode(url){
   var options = {
     'muteHttpExceptions': true,
     'followRedirects': false
   };
   var response = UrlFetchApp.fetch(url, options);
   return response.getResponseCode();
}

wow man this is all so very cool, i had to read it like 10 times to understand anything

''Start by building scripts and then figure out how to actually make them work as you go.''

there's one script I would like to build and it could be quite useful, could you help me out with it?
I mean the moment I share the code I already have, you'll know what's up so I guess you'll get a method for free?
I'll pm you if you're fine with it.
 
there's one script I would like to build and it could be quite useful, could you help me out with it?
I mean the moment I share the code I already have, you'll know what's up so I guess you'll get a method for free?
I'll pm you if you're fine with it.

Sounds good to me. Feel free to send a PM with details!
 
Mind doing a little video how to run this script for the not so savvy guys in here.

Thanks
 
Mind doing a little video how to run this script for the not so savvy guys in here.

Thanks
You can also use these very simple solution using importxml, just paste this into B1, in a1 write your domain .

=IMPORTXML(CONCAT("https://www.google.com/search?q=site:",A1),"//div[@id='resultStats']")

A1 = yoursite.com
B1 , paste formula above
C1 will show you either blank cell or X results, to actually show you how m any URLs are being indexed

simple and no scripts needed
 
You can scrape a lot of things with Google Sheets -- here's one of many things you can do with this. Script searches for "site:domain.com" in Google to check if there are indexed results for the input URL.

No need to run proxies since Google is scraping on your behalf and then returning the data to you.

This base code can be modified to scrape many other things as well. Working well for Reddit, as well as a number of other websites.

Step #1: Open the Script Editor in Google Sheets
View attachment 100448
Go to Tools --> Script Editor... to open the script editor.

Step #2: Copy-Paste Code into Script Editor
Make sure to save the file with Ctrl+S. The code to copy-paste is below.

Code:
function checkIfPageIsIndexed(url)
{
   url = "https://www.google.com/search?q=site:"+url;
   var options = {
     'muteHttpExceptions': true,
     'followRedirects': false
   };
   var response = UrlFetchApp.fetch(url, options);
   var html     = response.getContentText();

   if ( html.match(/Your search -.*- did not match any documents./) )
     return "URL is Not Indexed";
   return "URL is Indexed";
}

View attachment 100449
The file should look like the above once the code is inserted.

Step #3: Run the Function in Google Sheets

View attachment 100450
The function that was created in step #2 (checkIfPageIsIndexed(url)) can be run the same as any other spreadsheet function.


Step #4: Use the Output
Bonus: Use conditional formatting to add red/green highlight to cells depending on whether or not the input URL is indexed.

View attachment 100451
Your output should look something like this.


Great one! I never bothered checking what is indexed, I just blasted links to it and assumed it is. But this is great and quick tool to do it.

@MatthewGraham do you have an idea how to make the script check how many results are returned?

Like:

jPSAOb6


And put the 261,000,000 into excel? Would save my life.
 
Great one! I never bothered checking what is indexed, I just blasted links to it and assumed it is. But this is great and quick tool to do it.

@MatthewGraham do you have an idea how to make the script check how many results are returned?

Like:

jPSAOb6


And put the 261,000,000 into excel? Would save my life.

I messed with it some and came up with the code below. It mostly works, but is pretty buggy.

If you can fix the bugs and make a few adjustments, it should work.

Code:
function checkIfPageIsIndexed(url)
{
   if (!url)
     return "N/A";

   url = "https://www.google.com/search?q=site:"+url;
   var options = {
     'muteHttpExceptions': true,
     'followRedirects': false
   };
   var response = UrlFetchApp.fetch(url, options);
   var html     = response.getContentText();

   if ( html.match(/Your search -.*- did not match any documents./) )
     return "URL is Not Indexed";

   // <div id="resultStats">1 result<nobr> (0.21 seconds)&nbsp;</nobr></div>
   // <div id="resultStats">About 112,000,000 results<nobr> (0.22 seconds)&nbsp;</nobr></div>
   var num_pages_indexed = false;
   if ( num_pages_indexed = html.match(/id="resultStats">.*result/) ) 
   {
     num_pages_indexed = num_pages_indexed[0] + "(s)";
     num_pages_indexed = num_pages_indexed.replace("id=\"resultStats\">","");
     // num_pages_indexed = num_pages_indexed.replace("<","");
   }
   else
     num_pages_indexed = "ERROR";
   return "URL is Indexed and returned "+num_pages_indexed;
}

Example Output
page-index-buggy-code.png
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock