1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

C# Google Scraper - almost there

Discussion in 'C, C++, C#' started by DrSeuss, Feb 18, 2012.

  1. DrSeuss

    DrSeuss Newbie

    Joined:
    Oct 17, 2011
    Messages:
    23
    Likes Received:
    3
    Ok, so I'm almost at the point of having this thing ready to run for the most part without any user input.

    I've got the program using proxies in rotation. It uses a .csv file with the keywords to know what keyword to scrape data for. It builds the search query. I have it sleep the thread between 2 and 20s for each query.

    On proxies that return a 503, it will bring up the google captcha, allowing myself to enter the captcha and then return the set of cookies back to the program which are then stored with each proxy server so that on further requests, it will pull down the data I need without getting a 503 error.

    I also have it capture this additional information that it adds to the end of the url string sometimes...here's and example: "&gbv=1&sei=GXg-T5adE9PU4QTSv5CbCA"

    I have it append that to any subsequent urls in future requests...

    However, now I'm at a point that I can't seem to get ANY data. It will do maybe 6-7 total requests (about 2 each proxy), before just timing out. Which I've set to 2 minutes. I've analyzed the HTTP headers to be sure that they're identical to my browsers.

    Should I not be appending that above string info more than one time?

    Any information that can help would be appreciated?
     
  2. Chris22

    Chris22 Regular Member

    Joined:
    Sep 29, 2010
    Messages:
    400
    Likes Received:
    1,059
    Don't append that string to more than 1 url, it's probably some kind of access token/csrf token and google are probably rejecting requests where it's used twice.
     
  3. DrSeuss

    DrSeuss Newbie

    Joined:
    Oct 17, 2011
    Messages:
    23
    Likes Received:
    3
    ok...yeah i was thinking it was possibly that...i just have to append it the one time after I verify the captcha and it hands me a redirect request on the following call. After this I'm hoping that this is good to go...Thanks!
     
  4. hiderightnow

    hiderightnow Junior Member

    Joined:
    Jul 19, 2010
    Messages:
    104
    Likes Received:
    22
    Yeah, could be a session token and shouldn't be used more than once. If not, the webserver will most probably discard the subsequent requests. Check in a browser to see if the string doesn't change just before the captcha.