1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Question for scraping google keyword tool

Discussion in 'PHP & Perl' started by whitetogrey, Jul 4, 2012.

  1. whitetogrey

    whitetogrey Newbie

    Joined:
    Jul 4, 2012
    Messages:
    5
    Likes Received:
    0
    Hi there,

    sorry for my bad english, i'm a native german SEO

    I want to scrape the search volumes from the Keyword Tool of Google and in my opinion i'm on a good way.

    I found this lines in another board:
    ============
    1) Login to Google. I use this URL '/accounts/ServiceLoginAuth', but there are a few different ones.

    2) Request this url "/um/StartNewLogin?sourceid=awo&subid=ww-en-et-gaia" and follow all the redirects, there are 6 (this is setting cookies need for Adwords).

    3) When the last redirect complete and the HTML data is returned, extract the values for __u= and __c=. I use regex but any way will do.
    ============
    Step 1 is no problem at all, curl tells me a ReturnCode 200 and the cookie is set. In my opinion the script should be logged in.
    While requesting the second URL I'll get also a 200, but the script takes me back to Login Page, so i cannot extract u= and c=.

    My next thought was to use the URL checkSession, which says that cookies are not used. In my curl i use cookiejar as well as cookiefile
    Code:
      $url2 = "/um/StartNewLogin?sourceid=awo&subid=ww-en-et-gaia";
    
    curl_setopt($ch, CURLOPT_URL, $url2);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie); 
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_POST, false);
    $output = curl_exec($ch);
    $info = curl_getinfo($ch);
    
    Do anyone have a suggestion where my mistake is? Thanks for your help.

    BTW: as this is my first post, i can't write the whole url. before / there is https google
     
  2. Nattsurfaren

    Nattsurfaren Regular Member

    Joined:
    Apr 12, 2010
    Messages:
    409
    Likes Received:
    49
    You know it is captcha protected?
     
    Last edited: Jul 6, 2012
  3. whitetogrey

    whitetogrey Newbie

    Joined:
    Jul 4, 2012
    Messages:
    5
    Likes Received:
    0
    The normal login? nope, there is only a captcha when you fire up the target explorer without a login.
    the plan is to login, catch the cookie and grab the _u and _c values. (requires following the redirects...t here are quite a few ;))
    I don't know exactly if this works or not. Isn't it possible to take the steps through the login (this one works) and move on to the target explorer with working cookie?
     
  4. NIXMY

    NIXMY Regular Member Premium Member

    Joined:
    Sep 26, 2010
    Messages:
    481
    Likes Received:
    321
    Location:
    myproxylists.com
    Home Page:
    curl cannot handle properly cookies. You'll have to code your own cookie parser and then import those cookies via curl's custom header option. Try my function below and just append 'string_cookies' into headers. I've tried many times curls cookie support without success on various sites.

    Code:
    function get_cookies($cookies = null) {
    
        $string_cookies = null;
    
        if (preg_match_all('/Set-Cookie:(.*?)\n/i', $cookies, $matches, PREG_PATTERN_ORDER)) {
    
            if (count($matches[1]) >= 1) {
    
                foreach ($matches[1] as $key => $cookie) {
                    $cookie = trim($cookie);
                    $string_cookies .= "Cookie: $cookie"."\r\n";
                }
            }
        }
    
        return $string_cookies;
    }
    
    I don't currently have time to code this for you but here's what you should do. Grab all cookies from headers and import via provided function. Login failure to a form based site is all about cookies and cookie values. Upon connection to google, you get a cookie and more likely upon successful authetication either this cookie value will just change or you will get a new cookie with auth. data.
     
  5. tripper_john_md

    tripper_john_md Newbie

    Joined:
    Feb 21, 2011
    Messages:
    40
    Likes Received:
    35
    Location:
    Southern Germany
    Hi, curl seems to have a problem using cookies. I've had a similar problem while helping someone here a few months ago. Just save the cookie to a file and use it like that:
    PHP:
    curl_setopt_array($c,array(          CURLOPT_COOKIE            => file_get_contents('cookie.txt'),      )); 
    Keep an eye on the speed of your requests, google might track your requests per second and come up with some are-you-human-fun...