1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

C# + Google Searches

Discussion in 'C, C++, C#' started by Seuss, Aug 8, 2011.

  1. Seuss

    Seuss BANNED BANNED

    Joined:
    Jun 13, 2009
    Messages:
    56
    Likes Received:
    22
    Ok, so google's systems appear to be pretty decent at knowing when a bot is trying to take advantage of it.

    I'm trying to finish up a bot of mine, but now google is giving my bot a 503 service unavailable. If I am just wanting to do a google search to get the number of results back, should I be using something that google has readily available?

    Or do I need to randomize the headers???

    Thanks for any help.
     
  2. seo-madness

    seo-madness Newbie

    Joined:
    Mar 13, 2011
    Messages:
    7
    Likes Received:
    1
    Use the Google API for searching. If you want more control you'll have to alter "user-agent" in the header of each request. Rotating proxies is a good idea to.
     
  3. Seuss

    Seuss BANNED BANNED

    Joined:
    Jun 13, 2009
    Messages:
    56
    Likes Received:
    22
    ok thanks for the info...I'm hoping the google api offers the total number of results...
     
  4. seo-madness

    seo-madness Newbie

    Joined:
    Mar 13, 2011
    Messages:
    7
    Likes Received:
    1
    Code:
      public class SearchResult
      {
        public string url;
        public string title;
        public string content;
        public FindingEngine engine;
     
        public enum FindingEngine { google, bing, google_and_bing };
     
        public SearchResult(string url, string title, string content, FindingEngine engine)
        {
          this.url = url;
          this.title = title;
          this.content = content;
          this.engine = engine;
        }
      }
    
        public static List<SearchResult> GoogleSearch(string search_expression, 
          Dictionary<string, object> stats_dict)
        {
          var url_template = "ajax.googleapisDOTCOM/ajax/services/search/web?v=1.0&rsz=large \ 
            &safe=active&q={0}&start={1}";
          Uri search_url;
          var results_list = new List<SearchResult>();
          int[] offsets = { 0, 8, 16, 24, 32, 40, 48 };
          foreach (var offset in offsets)
          {
            search_url = new Uri(string.Format(url_template, search_expression, offset));
     
            var page = new WebClient().DownloadString(search_url);
     
            JObject o = (JObject)JsonConvert.DeserializeObject(page);
     
            var results_query =
              from result in o["responseData"]["results"].Children()
              select new SearchResult(
                  url: result.Value<string>("url").ToString(),
                  title: result.Value<string>("title").ToString(),
                  content: result.Value<string>("content").ToString(),
                  engine: SearchResult.FindingEngine.google
                  );
     
            foreach (var result in results_query)
              results_list.Add(result);
          }
     
          return results_list;
        }
    
     
  5. Baybo.it

    Baybo.it Registered Member

    Joined:
    Aug 9, 2011
    Messages:
    72
    Likes Received:
    39
    Occupation:
    Founder of Baybo.it
    Location:
    San Francisco
    Home Page:
    Just a note, Google's search API limits your searches to 100 per day (I think that's the number).

    I am not a huge fan of Microsoft, but you may consider using Bing's search API as I don't believe there's a restriction on query limit.

    Hope this helps.
     
  6. verdox

    verdox Regular Member

    Joined:
    Jun 5, 2011
    Messages:
    205
    Likes Received:
    76
    I have been playing with using WebClient found in the system.net namespace. Works well for me, simple example method that accepts a keyword and returns a list or URL's

    Code:
            private List<string> BuildCompetitorList(string _Keyword)
            {
                string _Result;
                using (WebClient _Client = new WebClient())
                {
                    _Result = _Client.DownloadString(new Uri(String.Format("[URL]http://www.google.com/search?q={0[/URL]}", _Keyword)));
                }
                List<string> _Competitors = new List<string>();
                MatchCollection links = Regex.Matches(_Result, "<a.*?href=\"(.*?)\".*?>(.*?)</a>");
                foreach (Match match in links)
                {
                    if (match.Value.Contains("class=l"))
                    {
                        string[] _Url = match.Value.Split('"');
                        _Competitors.Add(_Url[1]);
                    }
                }
                return _Competitors;
            }
    [SIZE=2][/SIZE]


    obviously needs work, but you get the idea :)
     
  7. zhaff

    zhaff Newbie

    Joined:
    Jan 5, 2010
    Messages:
    3
    Likes Received:
    0
    Location:
    Malaysia
    Home Page:
    just to share my experience, previously I used webclient but google keep asking for captcha input. Maybe my header is not good enough to make it. now I moved to using webBrowser control, so far so good just a bit slow.
     
    Last edited: Aug 21, 2011
  8. verdox

    verdox Regular Member

    Joined:
    Jun 5, 2011
    Messages:
    205
    Likes Received:
    76
    Write some proxy support in, and add a delay between requests. To be honest, web browser or web client will do the same job.
     
  9. dr_0x

    dr_0x Junior Member

    Joined:
    May 9, 2010
    Messages:
    141
    Likes Received:
    169
    Home Page:
    You can easily perform searches without running into the 503 forbidden error. The three key things to do are:

    1. Make sure your client uses cookies. (I forget if this is really that important but it doesn't hurt)
    2. Make sure you pass a valid User-agent like:

    Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.376.0 Safari/533.4

    3. Set a *random* wait of a few hundred milliseconds in between queries. This is super important!

    4. Supply all the required parameters with your query. Here they are: q, hl, sa, num, lr, ft, cr, safe, start, btnG

    my python param builder
    Code:
    def params(qry, num=100, start=0):
        params = DataDict(
            q     = '', 
            hl    = 'en',
            num   = '10',
            lr    = '', 
            ft    = 'i',
            cr    = '', 
            safe  = 'images',
            start = '', 
            btnG  = 'Search',
            )   
        params.q = qry 
        if num == 10: 
            params.pop('num')
        else:
            params.num = num 
            params.pop('btnG')
        if start == 0:
            params.pop('start')
        else:
            params.start = start
            params['sa'] = 'N' 
        return params
    
    
    That it using this method I can and do query G all freaking day long without any problems. The only issue you might run into is some queries will raise a flag and cause you to input a captcha before you can continue. In this case you will receive an http error (i forget which code right now) and will be re-directed to a captcha page, "hxxp://sorry.gxxgle.com/sorry" if I remember correctly. This only happens in queries G thinks are related to hacking so you won't run into this on general keyword queries.

    Finally, IMO it is way better than using the api because as stated before the api is limited to 100 queries a day and also api queries are limited to 10 results per query (that is horse sh*t IMO).
     
  10. dr_0x

    dr_0x Junior Member

    Joined:
    May 9, 2010
    Messages:
    141
    Likes Received:
    169
    Home Page:
    Yep I agree 100%. I just went and looked at a few of my scripts. To give you an example, one of them had a random wait of anywhere from .2 to 20 seconds. I have never run into any problems with it. Granted, 20sec might seem like a long time to wait but if you hit the 503 error I think its more like 20min you have to wait before you can search again. At 100 results per query its not that bad. I can go through the entire results set for 15 queries in just a few minutes.
     
  11. Seuss

    Seuss BANNED BANNED

    Joined:
    Jun 13, 2009
    Messages:
    56
    Likes Received:
    22
    Thank you for this fantastic information. I was thinking something like this is what needed to be done...now I can go on to complete my 2nd SEO tool...and then hopefully fix up my 3rd tool (site account creation tool with proxy rotation)
     
  12. Seuss

    Seuss BANNED BANNED

    Joined:
    Jun 13, 2009
    Messages:
    56
    Likes Received:
    22
    Thanks for the help again guys. I just finished my mini Google KW research tool, and it works flawlessly now. I'm now working on some auto-submit software for some of the auto approve article sites....
     
  13. imperial109

    imperial109 Regular Member

    Joined:
    Jan 19, 2009
    Messages:
    499
    Likes Received:
    361
    Integrate proxy support. Those 20sec will turn into 1. (20x increase in productivity)

    Also, if you have a site, record the useragents of visitors and get as many as possible and rotate those in conjuncture with the proxies.
     
    • Thanks Thanks x 1
  14. Xyz01

    Xyz01 Regular Member Premium Member

    Joined:
    Aug 8, 2011
    Messages:
    300
    Likes Received:
    126
    Heh. I scrape the search results, faking my user agent on each request, using multiple proxies and scrape 10,000-100,000+ keywords a day with no issues. :)
     
  15. Seuss

    Seuss BANNED BANNED

    Joined:
    Jun 13, 2009
    Messages:
    56
    Likes Received:
    22
    Yeah, I'm going to implement this tonight, as google has some pretty tricky systems and catch on FAST! so off to proxy integration and user-agent rotation...
     
  16. Seuss

    Seuss BANNED BANNED

    Joined:
    Jun 13, 2009
    Messages:
    56
    Likes Received:
    22
    Question: Do I need special proxies to scrape from Google? Even with the proxies, I get a 503 bad request error....I have proxy rotation setup and built in, just need to implement the user agent rotation, and figure this proxy 503 crap out so that i can be done with this program...
     
  17. lwelch45

    lwelch45 Junior Member

    Joined:
    Mar 24, 2010
    Messages:
    135
    Likes Received:
    38
    Home Page:
    yes you do, the proxies you use must not be blocked by google. I actually wrote a proxy checker that checkers if a proxy is blocked by google amoung other things.
     
  18. Seuss

    Seuss BANNED BANNED

    Joined:
    Jun 13, 2009
    Messages:
    56
    Likes Received:
    22
    well, I guess that means then that probably 98% of the proxies listed in the free section are usually not useable on google I'd assume?

    do you use scrapebox to find new proxies?
     
  19. haylander

    haylander Registered Member

    Joined:
    May 24, 2009
    Messages:
    54
    Likes Received:
    20
    from experience, google will ban you fast if you keep downloading all 10 pages while results are less than 1000, so operations needs to be synchronized.
    rotate headers& waiting on every pages are useless,you can check this by searching for the same phrase with firefox after getting 503 with bot.
    if you querying a lot ,you can't go without proxies
     
  20. xenon2010

    xenon2010 Regular Member

    Joined:
    Apr 27, 2010
    Messages:
    231
    Likes Received:
    48
    Occupation:
    web and desktop apps programmer
    Location:
    prison
    Home Page:
    simply use user-agents.. and put delay between every 2 searches..
    in my case I use 5 seconds delay between every 2 searches..