1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Google Php Curl

Discussion in 'PHP & Perl' started by howard83, Jul 14, 2011.

  1. howard83

    howard83 Junior Member

    Joined:
    Oct 21, 2007
    Messages:
    114
    Likes Received:
    15
    Hello,

    Im just new to php curl scripting, and been practicing for awhile,
    before my script is working fine scraping data from google using curl,
    but lately it just stops working and when i see my log file, its says:

    how can i manipulate this google protection in using CURL?

    Thanks in advance.
     
  2. meannn

    meannn Supreme Member

    Joined:
    Apr 22, 2009
    Messages:
    1,461
    Likes Received:
    1,896
    Occupation:
    Unemployed Winner
    Location:
    TR
    Add a code snippet which uses curl and scrape proxies from here samair.ru/proxy/fresh-proxy-list.htm
     
  3. chatmasta

    chatmasta Junior Member

    Joined:
    Sep 1, 2007
    Messages:
    122
    Likes Received:
    38
    Be conscious of rate limiting and use proxies. Also try switching between different google datacenter IPs (google for lists)
     
  4. AZer0

    AZer0 Newbie

    Joined:
    May 6, 2011
    Messages:
    26
    Likes Received:
    3
    I have been building my rank tracking software in php and here is a few problems i had to overcome,
    • Google limiting results (not using proxys i only reccomend searching no more then 2 searches every 10 min).
    • Google captcha, when using proxies or searching alot i tend to get a few captchas, make sure you search for this before scraping. Trash the proxy if found.
    • as said above switch google servers for every search
    The easiest solution i came up with is go purchase 10 private proxies (depends on your needs) and randomly select a proxy to rotate every search.
     
  5. Baybo.it

    Baybo.it Registered Member

    Joined:
    Aug 9, 2011
    Messages:
    72
    Likes Received:
    39
    Occupation:
    Founder of Baybo.it
    Location:
    San Francisco
    Home Page:
    The same origin policy occurs for many reasons. It's really an obnoxious problem. First, it can happen when you try and execute JS on an iframe that is not on the same domain on your own. Also, it maybe if you try to access someone else's page or make an XMLRPC or XHR to another domain.

    Sometimes a solution is to use Jsonp (json with padding). This will enable you to do XHR to a different subdomain. This does not seem to be your problem.

    If you're scraping from google, they may have banned you for not using their Search API. I'd suggest just using their Search API or using Bing's search api as google limits you to 100 searches a day and bing has no limit.
     
  6. accelerator_dd

    accelerator_dd Jr. VIP Jr. VIP Premium Member

    Joined:
    May 14, 2010
    Messages:
    2,441
    Likes Received:
    1,005
    Occupation:
    SEO
    Location:
    IM Wonderland
    For me all it took was to play with the referrer in CURL as well the useragent. if you take care of those two you should be good.
     
  7. maple_toast

    maple_toast Newbie

    Joined:
    Jul 23, 2008
    Messages:
    30
    Likes Received:
    50
    How would you switch google datacenters if you're calling google serps through curl?
     
  8. xenon2010

    xenon2010 Regular Member

    Joined:
    Apr 27, 2010
    Messages:
    231
    Likes Received:
    48
    Occupation:
    web and desktop apps programmer
    Location:
    prison
    Home Page:
    use this user agent with your CURL:
    PHP:
    Mozilla/5.0 (WindowsUWindows NT 5.1en-USrv:1.9.2.20Gecko/20110803 Firefox/3.6.20 ( .NET CLR 3.5.30729; .NET4.0E)
     
  9. relaxin

    relaxin Junior Member

    Joined:
    Aug 13, 2007
    Messages:
    100
    Likes Received:
    25
    Occupation:
    CEO
    If you use hundreds of proxies with different user agents you'll be OK.
     
  10. howard83

    howard83 Junior Member

    Joined:
    Oct 21, 2007
    Messages:
    114
    Likes Received:
    15
    hello guys thank you for answering my questions,

    actually i have noticed that my script is working fine if i wont use proxies. but if i use proxies the scripts stops and i see in the page is

    "Your browser's cookie functionality is turned off. Please turn it on"

    any solutions to this? i need to use proxies so that my IP wont get blocked by G00gle

    thank you very much.
     
  11. gimme4free

    gimme4free Executive VIP Jr. VIP Premium Member

    Joined:
    Oct 22, 2008
    Messages:
    1,881
    Likes Received:
    1,932
    You can use cookies with CURL:
    PHP:
    curl_setopt($chCURLOPT_COOKIEJAR'/cookies.txt');
    curl_setopt($chCURLOPT_COOKIEFILE'/cookies.txt');
     
  12. howard83

    howard83 Junior Member

    Joined:
    Oct 21, 2007
    Messages:
    114
    Likes Received:
    15
    its writing cookies without proxy but if i use proxy it cannot write cookies... thats the problem also...
     
  13. gimme4free

    gimme4free Executive VIP Jr. VIP Premium Member

    Joined:
    Oct 22, 2008
    Messages:
    1,881
    Likes Received:
    1,932
    Using a proxy shouldn't prevent the cookies from being saved, I use cookies daily with PHP/CURL & proxies.

    Sample function:
    PHP:
    <?php
    // CURL Function Defaults
    $curl_defaults = array(
        
    CURLOPT_HEADER => 0,
        
    CURLOPT_FOLLOWLOCATION => 1,
        
    CURLOPT_AUTOREFERER => 1,
        
    CURLOPT_RETURNTRANSFER => 1,
        
    CURLOPT_CONNECTTIMEOUT => 5,
        
    CURLOPT_TIMEOUT => 10,
        
    CURLOPT_VERBOSE => 0,
        
    CURLOPT_SSL_VERIFYHOST => 0,
        
    CURLOPT_SSL_VERIFYPEER => 0
        
    );
    function 
    Return_Content_From_URL($url,$accountid,$proxy,$port,$loginpassw,$proxytype,$referrer){
        global 
    $curl_defaults,$silent,$user_agents;
        
    $ch curl_init();
        
    curl_setopt_array($ch$curl_defaults);
        
    $agent_num substr($accountid,-1,1);
        
    curl_setopt($chCURLOPT_USERAGENT"Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3",);
        
    curl_setopt($chCURLOPT_URL,$url);
        if(
    $referrer!=0){curl_setopt($chCURLOPT_REFERER$referrer);}
        
    curl_setopt($chCURLOPT_PROXYPORT$port);
        if (
    $proxytype=="CURLPROXY_SOCKS5"){curl_setopt($chCURLOPT_PROXYTYPECURLPROXY_SOCKS5);}else{curl_setopt($chCURLOPT_PROXYTYPE"HTTP");}
        
    curl_setopt($chCURLOPT_PROXY$proxy);
        if (
    $loginpassw!="0:0"){
            
    curl_setopt($chCURLOPT_PROXYUSERPWD$loginpassw);
            }
        
    curl_setopt($chCURLOPT_COOKIEJARstr_replace('\\','/',dirname(__FILE__)).'/cookies/'.$accountid.'.txt');
        
    curl_setopt($chCURLOPT_COOKIEFILEstr_replace('\\','/',dirname(__FILE__)).'/cookies/'.$accountid.'.txt');
        
    $htmlcurl_exec($ch);
        
    $err 0;
        
    $err curl_errno($ch);
        
    curl_close($ch);
        if (
    $err!=0){
            if(
    $silent==0){echo "<b>Error Connecting To Proxy With Account ID: $accountid & Proxy: $proxy. CURL Error: $err</b><br />";}
            return 
    false;
            }
        return 
    $html;
        }
    $url "http://www.blackhatworld.com/";
    $accountid 1;
    $proxy "127.0.0.1";
    $port "8080";
    $loginpassw "0:0";
    $proxytype "HTTP";
    $referrer "http://www.google.com/";
    $content Return_Content_From_URL($url,$accountid,$proxy,$port,$loginpassw,$proxytype,$referrer);
    ?>
     
  14. howard83

    howard83 Junior Member

    Joined:
    Oct 21, 2007
    Messages:
    114
    Likes Received:
    15
    yeah thats what i thought, been using curl / php too for awhile, my other script is also using proxies and its working,

    btw, my script is for gmail contact/email grabber ... do you think this is because of gmail is preventing to use proxies when login?

    thank you very much for your sample code. ill grab it for reference.

     
  15. gimme4free

    gimme4free Executive VIP Jr. VIP Premium Member

    Joined:
    Oct 22, 2008
    Messages:
    1,881
    Likes Received:
    1,932
    I don't login to GMail with my scripts although I do login to Google on a couple of them with no issues, just make sure to follow ALL of the redirects:
    Code:
    https://accounts.google.com/Login?hl=en
    https://accounts.google.com/ServiceLoginAuth
    Now scrape the location.replace URL & follow
    https://accounts.google.com/CheckCookie?hl=en&chtml=LoginDoneHtml&pstMsg=1
    https://accounts.google.co.uk/
    
     
  16. zbigbz

    zbigbz Newbie

    Joined:
    Apr 30, 2011
    Messages:
    38
    Likes Received:
    16
    I'm still new at CURL and Proxies - just when I get one thing working and I try to ramp up I run into something else. It looks like the cookies and too frequent querying google may be what I'm running into.

    gimme4free's code below looks very interesting but I have a question about the $accountid parameter - if I'm using multiple proxies should each proxy have its own accountid? Sounds like overkill, but if I use the same accointid's cookies for a different proxy won't that allow google to link the proxies and flag my use?

    The reason I ask is that I am using a basic search string with a variety of keywords (e.g. hxxp:// www . googledotcom / search?hl=en&q=my+search+string) and multiple proxies.

    I get a few queries to go through but then get 301 and 302 Moved Errors with a link that take me to a captcha page that say's in part: "Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests and not a robot". The most frustrating thing is the the page continues to report my true IP and not the proxy.

    I assume that this is because I am running my queries too fast and/or not using proxies, but how does google see my actual IP.....

    There could be something else I'm screwing up and would welcome feedback before I implement 1 or 2 or 100 $accountid's and get even more confused.....
     
  17. gimme4free

    gimme4free Executive VIP Jr. VIP Premium Member

    Joined:
    Oct 22, 2008
    Messages:
    1,881
    Likes Received:
    1,932
    You don't have to use the accountid parameter, you could empty the cookies on each proxy change or even set the accountid to the proxy, that way each proxy will use its own cookie file and they will be stored ready for its next use. That's just a function I took from another script of mine that logs into accounts.
     
  18. howard83

    howard83 Junior Member

    Joined:
    Oct 21, 2007
    Messages:
    114
    Likes Received:
    15
    thank you very much, but back to the main concern is that my script is working fine without proxy :) ...

    i have noticed in my cookies if im using proxies is that it will put "httponly" on it... and im still searching for some answer about it... none find for now.
     
  19. xpwizard

    xpwizard Junior Member

    Joined:
    Nov 6, 2010
    Messages:
    198
    Likes Received:
    122
    Code:
    http://blog.php-security.org/archives/40-httpOnly-Cookies-in-Firefox-2.0.html
    the httponly part is nothing to worry about, curl handles them fine.
     
  20. zbigbz

    zbigbz Newbie

    Joined:
    Apr 30, 2011
    Messages:
    38
    Likes Received:
    16
    Thanks again gimme4free - I've been experimenting with using accountid = rand(0,10); that and making sure that my proxies are truly elite (not just anonymous, or "claimed" to be elite) is getting good results for me at last. I may go to using the proxy ip as the accountid and randomly blanking the cookie jar but that may be overkill.