C# + Google Searches

Seuss · Aug 8, 2011

Ok, so google's systems appear to be pretty decent at knowing when a bot is trying to take advantage of it.

I'm trying to finish up a bot of mine, but now google is giving my bot a 503 service unavailable. If I am just wanting to do a google search to get the number of results back, should I be using something that google has readily available?

Or do I need to randomize the headers???

Thanks for any help.

seo-madness · Aug 8, 2011

Use the Google API for searching. If you want more control you'll have to alter "user-agent" in the header of each request. Rotating proxies is a good idea to.

Seuss · Aug 8, 2011

ok thanks for the info...I'm hoping the google api offers the total number of results...

seo-madness · Aug 8, 2011

Code:

  public class SearchResult
  {
    public string url;
    public string title;
    public string content;
    public FindingEngine engine;
 
    public enum FindingEngine { google, bing, google_and_bing };
 
    public SearchResult(string url, string title, string content, FindingEngine engine)
    {
      this.url = url;
      this.title = title;
      this.content = content;
      this.engine = engine;
    }
  }

    public static List<SearchResult> GoogleSearch(string search_expression, 
      Dictionary<string, object> stats_dict)
    {
      var url_template = "ajax.googleapisDOTCOM/ajax/services/search/web?v=1.0&rsz=large \ 
        &safe=active&q={0}&start={1}";
      Uri search_url;
      var results_list = new List<SearchResult>();
      int[] offsets = { 0, 8, 16, 24, 32, 40, 48 };
      foreach (var offset in offsets)
      {
        search_url = new Uri(string.Format(url_template, search_expression, offset));
 
        var page = new WebClient().DownloadString(search_url);
 
        JObject o = (JObject)JsonConvert.DeserializeObject(page);
 
        var results_query =
          from result in o["responseData"]["results"].Children()
          select new SearchResult(
              url: result.Value<string>("url").ToString(),
              title: result.Value<string>("title").ToString(),
              content: result.Value<string>("content").ToString(),
              engine: SearchResult.FindingEngine.google
              );
 
        foreach (var result in results_query)
          results_list.Add(result);
      }
 
      return results_list;
    }

Baybo.it · Aug 10, 2011

Just a note, Google's search API limits your searches to 100 per day (I think that's the number).

I am not a huge fan of Microsoft, but you may consider using Bing's search API as I don't believe there's a restriction on query limit.

Hope this helps.

verdox · Aug 20, 2011

I have been playing with using WebClient found in the system.net namespace. Works well for me, simple example method that accepts a keyword and returns a list or URL's

Code:

        private List<string> BuildCompetitorList(string _Keyword)
        {
            string _Result;
            using (WebClient _Client = new WebClient())
            {
                _Result = _Client.DownloadString(new Uri(String.Format("[URL]http://www.google.com/search?q={0[/URL]}", _Keyword)));
            }
            List<string> _Competitors = new List<string>();
            MatchCollection links = Regex.Matches(_Result, "<a.*?href=\"(.*?)\".*?>(.*?)</a>");
            foreach (Match match in links)
            {
                if (match.Value.Contains("class=l"))
                {
                    string[] _Url = match.Value.Split('"');
                    _Competitors.Add(_Url[1]);
                }
            }
            return _Competitors;
        }
[SIZE=2][/SIZE]

obviously needs work, but you get the idea

zhaff · Aug 21, 2011

just to share my experience, previously I used webclient but google keep asking for captcha input. Maybe my header is not good enough to make it. now I moved to using webBrowser control, so far so good just a bit slow.

verdox · Aug 21, 2011

zhaff said:
just to share my experience, previously I used webclient but google keep asking for captcha input. Maybe my header is not good enough to make it. now I moved to using webBrowser control, so far so good just a bit slow.

Write some proxy support in, and add a delay between requests. To be honest, web browser or web client will do the same job.

dr_0x · Aug 21, 2011

You can easily perform searches without running into the 503 forbidden error. The three key things to do are:

1. Make sure your client uses cookies. (I forget if this is really that important but it doesn't hurt)
2. Make sure you pass a valid User-agent like:

Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.376.0 Safari/533.4

3. Set a *random* wait of a few hundred milliseconds in between queries. This is super important!

4. Supply all the required parameters with your query. Here they are: q, hl, sa, num, lr, ft, cr, safe, start, btnG

my python param builder

Code:

def params(qry, num=100, start=0):
    params = DataDict(
        q     = '', 
        hl    = 'en',
        num   = '10',
        lr    = '', 
        ft    = 'i',
        cr    = '', 
        safe  = 'images',
        start = '', 
        btnG  = 'Search',
        )   
    params.q = qry 
    if num == 10: 
        params.pop('num')
    else:
        params.num = num 
        params.pop('btnG')
    if start == 0:
        params.pop('start')
    else:
        params.start = start
        params['sa'] = 'N' 
    return params

That it using this method I can and do query G all freaking day long without any problems. The only issue you might run into is some queries will raise a flag and cause you to input a captcha before you can continue. In this case you will receive an http error (i forget which code right now) and will be re-directed to a captcha page, "hxxp://sorry.gxxgle.com/sorry" if I remember correctly. This only happens in queries G thinks are related to hacking so you won't run into this on general keyword queries.

Finally, IMO it is way better than using the api because as stated before the api is limited to 100 queries a day and also api queries are limited to 10 results per query (that is horse sh*t IMO).

dr_0x · Aug 22, 2011

s0ap_ said:
This is very important - G will seem some systemic variation based on network load, etc but it is pretty trivial to profile something as automatic especially when it is coming from the same IP.

I would even consider upping the number personally, the more random the requests appear the better.

Yep I agree 100%. I just went and looked at a few of my scripts. To give you an example, one of them had a random wait of anywhere from .2 to 20 seconds. I have never run into any problems with it. Granted, 20sec might seem like a long time to wait but if you hit the 503 error I think its more like 20min you have to wait before you can search again. At 100 results per query its not that bad. I can go through the entire results set for 15 queries in just a few minutes.

Seuss · Aug 23, 2011

Thank you for this fantastic information. I was thinking something like this is what needed to be done...now I can go on to complete my 2nd SEO tool...and then hopefully fix up my 3rd tool (site account creation tool with proxy rotation)

Seuss · Aug 25, 2011

Thanks for the help again guys. I just finished my mini Google KW research tool, and it works flawlessly now. I'm now working on some auto-submit software for some of the auto approve article sites....

imperial109 · Aug 25, 2011

Integrate proxy support. Those 20sec will turn into 1. (20x increase in productivity)

Also, if you have a site, record the useragents of visitors and get as many as possible and rotate those in conjuncture with the proxies.

Xyz01 · Aug 25, 2011

Baybo.it said:
Just a note, Google's search API limits your searches to 100 per day (I think that's the number).

I am not a huge fan of Microsoft, but you may consider using Bing's search API as I don't believe there's a restriction on query limit.

Hope this helps.

Heh. I scrape the search results, faking my user agent on each request, using multiple proxies and scrape 10,000-100,000+ keywords a day with no issues.

Seuss · Aug 26, 2011

Yeah, I'm going to implement this tonight, as google has some pretty tricky systems and catch on FAST! so off to proxy integration and user-agent rotation...

Seuss · Aug 28, 2011

Question: Do I need special proxies to scrape from Google? Even with the proxies, I get a 503 bad request error....I have proxy rotation setup and built in, just need to implement the user agent rotation, and figure this proxy 503 crap out so that i can be done with this program...

The Data Guy · Aug 30, 2011

yes you do, the proxies you use must not be blocked by google. I actually wrote a proxy checker that checkers if a proxy is blocked by google amoung other things.

Seuss · Aug 30, 2011

well, I guess that means then that probably 98% of the proxies listed in the free section are usually not useable on google I'd assume?

do you use scrapebox to find new proxies?

haylander · Aug 30, 2011

from experience, google will ban you fast if you keep downloading all 10 pages while results are less than 1000, so operations needs to be synchronized.
rotate headers& waiting on every pages are useless,you can check this by searching for the same phrase with firefox after getting 503 with bot.
if you querying a lot ,you can't go without proxies

xenon2010 · Sep 8, 2011

simply use user-agents.. and put delay between every 2 searches..
in my case I use 5 seconds delay between every 2 searches..

C# + Google Searches

BANNED

Newbie

BANNED

Newbie

Registered Member

Regular Member

Newbie

Regular Member

Regular Member

Regular Member

BANNED

BANNED

Regular Member

Regular Member

BANNED

BANNED

Junior Member

BANNED

Registered Member

Regular Member

Main Menu

Marketplace

Making Money

BlackHat World