Scraping Google for GSA Site Lists

kijix · Mar 30, 2016

Been trying to scrape using gsa footprints in scrapebox to generate a site list, but am having a ton of issues with IP bans. Private proxies would be too expensive for the thread count i'd need and public proxies it's impossible to find enough google passed ones. The ones you do find end up not working an hour later.

How have you been scraping for your lists? Do you use Gscraper and their proxy service or scrapebox and another proxy provider?

I hear the Gscraper proxy service is good but I just paid for scrapebox and really don't want to pay another $68 + $66/mo for gscraper and their proxy service on top of it.

Devil Rider · Mar 30, 2016

you adjust maximum connections allows you to tell Scrapebox how many threads you want to use for each search engine

loopline · Mar 30, 2016

I know I just posted about this in a different thread answering your same question there.

Anyway, just get 10 shared proxies for $10 and set a delay. In the past 24 hours since you have been asking about this you could have already scraped tens of thousands of results.

JustUs · Mar 30, 2016

I use a combination of public scraped proxies and port scanned proxies along with the Custom Harvester.

Herectioul66 · Mar 31, 2016

JustUs said:
I use a combination of public scraped proxies and port scanned proxies along with the Custom Harvester.

seems good and best strategy. I will try it in future too.

kijix · Mar 31, 2016

Have you actually -tried- what you are suggesting since google has started banning whole IP ranges? 10 proxies aren't going to do a thing, I have 10 private proxies. I have tried your method, setting a delay and the # of results i get are worse than if I just hammered it with 100 threads and let my ban ride out on the IPs and started over, which is not many, like 1500 results. I am using complex queries like inurl.

loopline said:
I know I just posted about this in a different thread answering your same question there.

Anyway, just get 10 shared proxies for $10 and set a delay. In the past 24 hours since you have been asking about this you could have already scraped tens of thousands of results.

kijix · Mar 31, 2016

Yeah I use GSA proxy scraper too but still having issues. Just PM'd you. thanks for your reply.

JustUs said:
I use a combination of public scraped proxies and port scanned proxies along with the Custom Harvester.

JustUs · Mar 31, 2016

I have been getting a few PM's, so I will answer them here. I will only speak of the combination of GSA Proxy Scraper and ScrapeBox. If you are using some other combination, your mileage will probably vary.

Proxy Scraper for Google. set the scraper to export files under settings -> automatic export. click "add" and select 'save to file. Select a file name and export time interval. Under "Export format" select Scrapebox. Click 'OK. A new window opens, select "Web, Socks4, and Socks5" under 'Proxy types.' For scraping I do not care if a file is anonymous, or not, so I select "Transparent, Anonymous, and Elite." Leave 'Acceptable port' empty. Select 'Skip duplicate IPs.' Select 'Public and Google' under include tags. You can leave everything in 'Exclude tags' blank. Under 'Acceptable regions' select the regions you want to exclude.

Scrapebox.
Enter a keyword, or just a letter, and select the Custom Harvester. Under the proxy drop down at the bottom and select 'Autoload proxy file' and enter the file you will be exporting from GSA. Select the Proxy dropdown again and click in the bar to the right of 'Enable autoload proxies from file,' a check will appear. Select 'Load after X minutes' and enter a time in minutes, or leave blank to allow Scrapebox to determine when it need new proxies. I use 60 - 120 minutes.

Scrapebox is now set up to use a proxy file and GSA is setup to export a proxy file.

For your keywords, find a good merge list. Loopline has a decent one somewhere, you will have to hunt it up. Use the keyscraper along with your targeted keywords. (Click scrape in the keyword box after entering a FEW targeted keywords). Set the sources up to the sources you want to use, select the level (how deep) you want to search keywords at, or select them from the domain. Choose whether to use proxies, and start. If you have a good merge list, such as Looplines, then be careful of how many keywords you scrape. A few keywords will create a large list!

I break the above file down into sets of 10000 keyterms and scrape those individually if scraping Google. If scraping Bing, I combine several 10000 keyterm files, replace inurl; with instreamset:url: which is Bing's equivalent. I replace the other keywords as appropriate.

Scraping:
If scraping Google, I use the Google API rather than Google. Rarely, I will use a country specific Google. The Google API is less sensitive to proxies than the other Google settings.

I scrape Bing and Google separately. Ditto with the other search engines. I seldom scrape Yahoo because they get their results from Bing

Using these methods, the fastest I have achieved scraping Google with public proxies was 173 url's/sec; average is 37 - 39. I am scraping Bing right now and averaging 700 url's sec. I had to turn the threads down because I kept overrunning the buffer in my modem.

How fast you scrape actually depends on how many results there are for a keyterm rather than how many threads you run!

Do I consider GSA Proxy Scraper worth the money? Yes, otherwise I would not show it!!!! GSA has the option of port scanning in addition to proxy scraping. You can highlight proxies (or a range of proxies), say Google tagged, that you want to scan, right click and send to the proxy scanner by right clicking. You can easily enter ports, other than what GSA selects easily, set the threads you want to devote to the scanning, and start the scanner.

You now have a few tricks.

kijix · Mar 31, 2016

thank you for this well detailed reply. it is very helpful and I will try this today. I notice you are not using 'connect' proxies. maybe that is part of my problem.

JustUs · Mar 31, 2016

kijix said:
thank you for this well detailed reply. it is very helpful and I will try this today. I notice you are not using 'connect' proxies. maybe that is part of my problem.

The only pieces of software I am aware of that use 'Connect' proxies are GSA products. In every other piece of software I have attempted to use them in, they either act flaky or do not work at all.

Many (most) of public proxys work for Scrapebox, even if they do not test and tag for Google.

Sven messed up the port scanner with the latest (1.60) update. Do not update from 1.59 if you are going to port scan - wait until 1.61 when he fixes it.

loopline · Mar 31, 2016

kijix said:
Have you actually -tried- what you are suggesting since google has started banning whole IP ranges? 10 proxies aren't going to do a thing, I have 10 private proxies. I have tried your method, setting a delay and the # of results i get are worse than if I just hammered it with 100 threads and let my ban ride out on the IPs and started over, which is not many, like 1500 results. I am using complex queries like inurl.

Yes. I use this method all the time, although in proxies with groups of 50 proxies, not 10, but same concept. I have runs that go for weeks straight. I use this with over 300 proxies in groups of 50 proxies per server and Ive yet to have any issues on any server. I also know of many other people that do this.

If your getting banned just that fast then your either still going too fast or perhaps you just had bad proxies to start, google does permanantly ban ip ranges, but I have not had this issue and I get all 300+ proxies rotate out monthly so out of thousands of ips over the past many months I have had no issues. Like I said I know a lot of other people doing it without issues. Its not an end all, its just 1 option that if done correctly can work just fine.

Bahmer · Mar 31, 2016

loopline said:
Yes. I use this method all the time, although in proxies with groups of 50 proxies, not 10, but same concept. I have runs that go for weeks straight. I use this with over 300 proxies in groups of 50 proxies per server and Ive yet to have any issues on any server. I also know of many other people that do this.

If your getting banned just that fast then your either still going too fast or perhaps you just had bad proxies to start, google does permanantly ban ip ranges, but I have not had this issue and I get all 300+ proxies rotate out monthly so out of thousands of ips over the past many months I have had no issues. Like I said I know a lot of other people doing it without issues. Its not an end all, its just 1 option that if done correctly can work just fine.

Or you could just pick up a subscription to proxygo's SB service. He just delivers daily proxies for life for a one time fee.

kijix · Mar 31, 2016

JustUs said:
The only pieces of software I am aware of that use 'Connect' proxies are GSA products. In every other piece of software I have attempted to use them in, they either act flaky or do not work at all.

Many (most) of public proxys work for Scrapebox, even if they do not test and tag for Google.

Sven messed up the port scanner with the latest (1.60) update. Do not update from 1.59 if you are going to port scan - wait until 1.61 when he fixes it.

your method is working great, now that i finally found where they put the 'auto load' proxies button. i can't thank you enough. also at the 40 LPS average you get.

kijix · Mar 31, 2016

loopline said:
Yes. I use this method all the time, although in proxies with groups of 50 proxies, not 10, but same concept. I have runs that go for weeks straight. I use this with over 300 proxies in groups of 50 proxies per server and Ive yet to have any issues on any server. I also know of many other people that do this.

If your getting banned just that fast then your either still going too fast or perhaps you just had bad proxies to start, google does permanantly ban ip ranges, but I have not had this issue and I get all 300+ proxies rotate out monthly so out of thousands of ips over the past many months I have had no issues. Like I said I know a lot of other people doing it without issues. Its not an end all, its just 1 option that if done correctly can work just fine.

i believe you, it just has not given me good results, though ive only had 10 private proxies to test with.

what is interesting is this video here: https://www.youtube.com/watch?v=iAC1WLG2aU8

i have captcha solving configured on scrapebox, but i don't think its actually using it when scraping google.

loopline · Apr 1, 2016

kijix said:
i believe you, it just has not given me good results, though ive only had 10 private proxies to test with.

what is interesting is this video here: https://www.youtube.com/watch?v=iAC1WLG2aU8

i have captcha solving configured on scrapebox, but i don't think its actually using it when scraping google.

No Scrapebox does not solve captchas when scraping google. They did back years ago, but google rebans more or less instantlhy so your just churning thru captchas, further its a slow process and it causes other issues. At then end of they day they took the feature out and have never added it back in.

Also I saw that video and I think its kind of a joke. Who wants to sit and manually solve catpchas? I mean if you have all the time in the world I guess, but I can scrape hundreds of thousands of results if not millions without lifting a finger. I wasn't impressed with Piagham bot when I got it, the primary footprint they use that is built is scraping wordpress contact forms that Scrapebox can already post to, and Scrapebox is faster. Piagham was able to post to a small percentage of forms out of the box that Scrapebox couldn't but all in all for the price Scrapebox blows it out of the water, IMHO and I just built the largest of those forms into Scrapebox and eliminated 95% of the hassle that Piagham was causing me and got similar results (actually a lot more because Scrapebox excessively faster and more memory efficient.) and nuked the cost of Piagham.

Given we were talking about scraping, coming back on topic here, if you want a load of results try google api and deeper web, they are both getting the results from google and they have more lax IP bans. Also you can get a lot of great results from bing. Sometimes the solution isn't to go the hardest road.

I don't exclusively use google.

In fact I did extensive testing of footprints when I was building a contact form Im using in Scrapebox, and Some footprints performed better in google, while others actually gave better results in bing.

By performed better I mean I would scrape and then run them thru page scanner to see what platforms actually were accurately returned for the footprint. There were queries where I was looking for a given CMS platform and Bing would return 75% of pages from the platform and google would only return 25% for the same query. Of course there were queries where the reverse was true, but the point is you can do a lot of "damage" in getting what you want without google, and if your going to invest the work to do so over and over its worth a little time to figure out if you can get what you want with much less resistance.

Just my 2 cents.

kijix · Apr 1, 2016

loopline said:
No Scrapebox does not solve captchas when scraping google. They did back years ago, but google rebans more or less instantlhy so your just churning thru captchas, further its a slow process and it causes other issues. At then end of they day they took the feature out and have never added it back in.

Also I saw that video and I think its kind of a joke. Who wants to sit and manually solve catpchas? I mean if you have all the time in the world I guess, but I can scrape hundreds of thousands of results if not millions without lifting a finger. I wasn't impressed with Piagham bot when I got it, the primary footprint they use that is built is scraping wordpress contact forms that Scrapebox can already post to, and Scrapebox is faster. Piagham was able to post to a small percentage of forms out of the box that Scrapebox couldn't but all in all for the price Scrapebox blows it out of the water, IMHO and I just built the largest of those forms into Scrapebox and eliminated 95% of the hassle that Piagham was causing me and got similar results (actually a lot more because Scrapebox excessively faster and more memory efficient.) and nuked the cost of Piagham.

Given we were talking about scraping, coming back on topic here, if you want a load of results try google api and deeper web, they are both getting the results from google and they have more lax IP bans. Also you can get a lot of great results from bing. Sometimes the solution isn't to go the hardest road. I don't exclusively use google.

In fact I did extensive testing of footprints when I was building a contact form Im using in Scrapebox, and Some footprints performed better in google, while others actually gave better results in bing.

By performed better I mean I would scrape and then run them thru page scanner to see what platforms actually were accurately returned for the footprint. There were queries where I was looking for a given CMS platform and Bing would return 75% of pages from the platform and google would only return 25% for the same query. Of course there were queries where the reverse was true, but the point is you can do a lot of "damage" in getting what you want without google, and if your going to invest the work to do so over and over its worth a little time to figure out if you can get what you want with much less resistance.

Just my 2 cents.

well what i found more impressive was the LPM he was able to harvest with only a handful of private proxies. i think the point is to use services like 2captcha, not sitting there and filling them in yourself

anyway i was able to get scrapebox going with private proxies from a different provider, so now i'm able to scrape with both public or private proxies at reasonable speed. the only issue is scrapebox freezing toward the end of a scrape OR when i force the scrape to stop and it leaves 'Please Wait' forever until i force it to quit

710fla · Jul 1, 2016

loopline said:
I know I just posted about this in a different thread answering your same question there.

Anyway, just get 10 shared proxies for $10 and set a delay. In the past 24 hours since you have been asking about this you could have already scraped tens of thousands of results.

How long of a delay to you set when scraping Google? Is 120 seconds enough?

redrays · Jul 1, 2016

JustUs said:
I have been getting a few PM's, so I will answer them here. I will only speak of the combination of GSA Proxy Scraper and ScrapeBox. If you are using some other combination, your mileage will probably vary.

Proxy Scraper for Google. set the scraper to export files under settings -> automatic export. click "add" and select 'save to file. Select a file name and export time interval. Under "Export format" select Scrapebox. Click 'OK. A new window opens, select "Web, Socks4, and Socks5" under 'Proxy types.' For scraping I do not care if a file is anonymous, or not, so I select "Transparent, Anonymous, and Elite." Leave 'Acceptable port' empty. Select 'Skip duplicate IPs.' Select 'Public and Google' under include tags. You can leave everything in 'Exclude tags' blank. Under 'Acceptable regions' select the regions you want to exclude.

Scrapebox.
Enter a keyword, or just a letter, and select the Custom Harvester. Under the proxy drop down at the bottom and select 'Autoload proxy file' and enter the file you will be exporting from GSA. Select the Proxy dropdown again and click in the bar to the right of 'Enable autoload proxies from file,' a check will appear. Select 'Load after X minutes' and enter a time in minutes, or leave blank to allow Scrapebox to determine when it need new proxies. I use 60 - 120 minutes.

Scrapebox is now set up to use a proxy file and GSA is setup to export a proxy file.

For your keywords, find a good merge list. Loopline has a decent one somewhere, you will have to hunt it up. Use the keyscraper along with your targeted keywords. (Click scrape in the keyword box after entering a FEW targeted keywords). Set the sources up to the sources you want to use, select the level (how deep) you want to search keywords at, or select them from the domain. Choose whether to use proxies, and start. If you have a good merge list, such as Looplines, then be careful of how many keywords you scrape. A few keywords will create a large list!

I break the above file down into sets of 10000 keyterms and scrape those individually if scraping Google. If scraping Bing, I combine several 10000 keyterm files, replace inurl; with instreamset:url: which is Bing's equivalent. I replace the other keywords as appropriate.

Scraping:
If scraping Google, I use the Google API rather than Google. Rarely, I will use a country specific Google. The Google API is less sensitive to proxies than the other Google settings.

I scrape Bing and Google separately. Ditto with the other search engines. I seldom scrape Yahoo because they get their results from Bing

Using these methods, the fastest I have achieved scraping Google with public proxies was 173 url's/sec; average is 37 - 39. I am scraping Bing right now and averaging 700 url's sec. I had to turn the threads down because I kept overrunning the buffer in my modem.

How fast you scrape actually depends on how many results there are for a keyterm rather than how many threads you run!

Do I consider GSA Proxy Scraper worth the money? Yes, otherwise I would not show it!!!! GSA has the option of port scanning in addition to proxy scraping. You can highlight proxies (or a range of proxies), say Google tagged, that you want to scan, right click and send to the proxy scanner by right clicking. You can easily enter ports, other than what GSA selects easily, set the threads you want to devote to the scanning, and start the scanner.

You now have a few tricks.

I know this is an old post, but I saw it because this thread got bumped and it's extremely helpful. Thank you for taking the time to share

loopline · Jul 2, 2016

710fla said:
How long of a delay to you set when scraping Google? Is 120 seconds enough?

That depends on how many proxies you have. Its probably enough though, given that you could probably even skip proxies and use your own ip with a delay of 120

Scraping Google for GSA Site Lists

Registered Member

BANNED FOR SPAM/FAKE REVIEWS

Elite Member

Power Member

BANNED

Registered Member

Registered Member

Power Member

Registered Member

Power Member

Elite Member

Regular Member

Registered Member

Attachments

Registered Member

Elite Member

Registered Member

Elite Member

Regular Member

Elite Member

Main Menu

Marketplace

Making Money

BlackHat World