1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping Sites - Curl Method

Discussion in 'PHP & Perl' started by Lyscer, Jun 20, 2013.

  1. Lyscer

    Lyscer Junior Member

    Joined:
    Jun 29, 2012
    Messages:
    109
    Likes Received:
    46
    Occupation:
    Software Engineer
    I have been dealing with a large amount of site scrapings lately and have tweaked my generic "Curl" function like mad so that it fits many situations. I wanted to share it on here and give back a little; Hopefully it helps some of you out. Please feel free to ask any questions. I have used this method to do the following:

    - Login to sites using basic http authentication
    - Login to sites using sessions and CSRF authentication
    - Login to a site, switch pages and download a csv file which I then parsed
    - many other things

    so as you can see the function can be applied to many situations, it is just however you want to apply it.

    Code:
    /**
         * 
         * @param type $url
         * @param array $params -> array('cookie_file'=>'path_to_cookie_file',
         *                               'start_new_cookie'=>false,
         *                               'cookie_jar'=>'cookie_file_name',
         *                               'user_agent'=>'the_user_agent',
         *                               'post_params'=>'var=1&var2=2',
         *                               'follow_location'=>false,
         *                               'http_referer'=>'specify a page that referred you',
         *                               'header'=>'header_array',
         *                               'debug'=>false,
         *                               'redirect_call_back'=>'function_to_call');
         * @return array('html'=>'the html of the page that was retrived',
         *               'info'=>'the header information of the page that was retrieved');
         */
        private function getWebPage($url, $params=array()){
            $return = array();
    
    
            $ch = curl_init($url);
            
            if(isset($params['cookie_file']) && $params['cookie_file'] != '')
            {
                // Forces a new Session
                if(isset($params['start_new_cookie']) && $params['start_new_cookie'])
                    curl_setopt($ch, CURLOPT_COOKIESESSION, 1);
                
                curl_setopt($ch, CURLOPT_COOKIEJAR, $params['cookie_file']);
                curl_setopt($ch, CURLOPT_COOKIEFILE, $params['cookie_jar']);
            }
            if(isset($params['user_agent']))
                curl_setopt($ch, CURLOPT_USERAGENT, $params['user_agent']);
            if(isset($params['post_params']) && $params['post_params'] != '')
            {
                curl_setopt($ch, CURLOPT_POST, 1);
                curl_setopt($ch, CURLOPT_POSTFIELDS, $params['post_params']);
            }
            
            if(isset($params['header']))
            {
                curl_setopt($ch, CURLOPT_HTTPHEADER, $params['header']);
            }
            
            if(isset($params['debug']) && $params['debug'] == true)
            {
                curl_setopt($ch, CURLINFO_HEADER_OUT, true);
                curl_setopt($ch, CURLOPT_HEADER, 1);
                curl_setopt($ch, CURLOPT_VERBOSE, 1);
            }
            if(isset($params['http_referer']))
                curl_setopt($ch, CURLOPT_REFERER, $params['http_referer']);
            else
                curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
            
            curl_setopt($ch, CURLOPT_TIMEOUT, 600);
            curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  
            
            if(isset($params['follow_location']))
                curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $params['follow_location']);
            else 
                curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
            
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false ) ;
            curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false ) ;
            
            $html = curl_exec($ch);
            $info = curl_getinfo($ch);
            $http_code = $info['http_code'];
            
            $return['html'] = $html;
            $return['info'] = $info;
            
            if(isset($params['debug']) && $params['debug'] == true)
            {
                echo "http_code: $http_code<br />";
                echo 'Debug: <br /><pre>'.print_r($return['info'], true).'</pre><br />';
                echo 'output: <br /><pre>'.htmlentities($return['html']).'</pre><br />';
                echo '----------------------------------------<br />';
            }
            
            if(isset($info['redirect_url']) && strlen($info['redirect_url']) > 4 ) {
                if(isset($params['redirect_call_back'])){ // callback
                    $redirectcallback = $params['redirect_call_back'];
                    $this->$redirectcallback($info['redirect_url'], $url);
                }
                
                $return = $this->getWebPage($info['redirect_url'], $params);
            } else if(strpos($html, 'CONTENT="0;'))
            {
                // look for a meta refresh, if one exists, then redirect to it 
                $meta = substr($html, strpos($html, 'CONTENT="0;')+11);
                $redirect_url = substr($meta, 0, strpos($meta, '"'));
    
    
                $redirectcallback = $params['redirect_call_back'];
                $this->$redirectcallback($redirect_url, $url);
                
                echo "Meta Refresh found!<br />";
                $return = $this->getWebPage($info['redirect_url'], $params);
            }
            
            return $return;
        }
    
    You would simple put that method in your php project and then you can call it super simple like:

    Code:
    $results = $this->getWebPage('http://www.google.com');
    
    If you are looking for something a little more complex and would like to login to a site, pretend to be firefox on a mac, see debugging information and save the cookies:
    Code:
    $params = array('debug'=>true,
                               'post_params'=>'username=username&password=password',
                               'cookie_jar'=>'path/to/cookie/cookie.txt',
                               'cookie_file'=>'path/to/cookie/cookie.txt',
                               'user_agent'=>'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20100101 Firefox/17.0');
    $results = $this->getWebPage('http://www.domain.com/login', $params);
    
    ** Note: this is just a Curl helper method that I have been tweaking to allow me to login to sites that I tell it to. I know that I login to site 'X' and always get content 'Y' so I use a tool TamperHeaders in firefox to tell me what fields are being submitted with the form and then put those as the post params. I have other functions that perform the processing, I just wanted to share this as it was a PITA learning it and figuring out how everything went together.
     
    • Thanks Thanks x 10
  2. innozemec

    innozemec Jr. VIP Jr. VIP

    Joined:
    Aug 19, 2011
    Messages:
    5,290
    Likes Received:
    1,799
    Location:
    www.Indexification.com
    Home Page:
  3. Conor

    Conor Jr. VIP Jr. VIP

    Joined:
    Nov 7, 2012
    Messages:
    3,373
    Likes Received:
    5,437
    Gender:
    Male
    Location:
    South Africa
    Home Page:
  4. fopen

    fopen Newbie

    Joined:
    Aug 2, 2010
    Messages:
    4
    Likes Received:
    0
    Thanks for this great post man ! Do you know if it has problems working with the new 5.4.0 PHP or 5.5.0 ? Will run some tests and let you know. Since register_globals is removed I had to redo my scripts entirely.
     
  5. davids355

    davids355 Jr. VIP Jr. VIP Premium Member

    Joined:
    Apr 25, 2011
    Messages:
    8,802
    Likes Received:
    6,371
    Home Page:
    very nice script mate. Was actually looking for something like this a while back.
     
  6. Gogol

    Gogol Elite Member

    Joined:
    Sep 10, 2010
    Messages:
    3,066
    Likes Received:
    2,872
    Gender:
    Male
    Good piece of code. Keep up the good works dude :)
     
  7. rrocha80

    rrocha80 Junior Member

    Joined:
    Apr 19, 2013
    Messages:
    132
    Likes Received:
    24
    Great contribution, thank you!
     
  8. Lyscer

    Lyscer Junior Member

    Joined:
    Jun 29, 2012
    Messages:
    109
    Likes Received:
    46
    Occupation:
    Software Engineer
    I am currently running it on a machine with 5.4.4 and have zero issues. I looked at the php change log for php 5.5 and it doesn't look like there would need to be any code changes for it to work with 5.5. Glad you guys like it, let me know if you have any questions.
     
  9. Lyscer

    Lyscer Junior Member

    Joined:
    Jun 29, 2012
    Messages:
    109
    Likes Received:
    46
    Occupation:
    Software Engineer
    I thought that I should add a little bit to this just to clarify.

    Sometimes curl will temp you to login to a site and traverse the dom to manipulate the values that you need out of an HTML page. However, always look to google or the companies website to make sure that they don't have an API readily available. If they have an api then the above code will allow you to call their api calls, you just have to call it correctly. If an api doesn't exist then and only then should you put the time into traversing the dom. This is because all it takes is a small html update and your traversals are screwed for the most part. Good luck and keep the questions coming.
     
  10. nopme88

    nopme88 Registered Member

    Joined:
    Jul 30, 2013
    Messages:
    50
    Likes Received:
    7
    Occupation:
    Freelancer
    Keep it up mate! Thank you
     
  11. methylenebl

    methylenebl Newbie

    Joined:
    Aug 6, 2013
    Messages:
    22
    Likes Received:
    5
    Amazing work!
     
  12. arogers

    arogers Newbie

    Joined:
    Sep 15, 2013
    Messages:
    13
    Likes Received:
    0
    Very cool piece of code, just what i was looking for
     
  13. davids355

    davids355 Jr. VIP Jr. VIP Premium Member

    Joined:
    Apr 25, 2011
    Messages:
    8,802
    Likes Received:
    6,371
    Home Page:
    Ah, forgot about this, pretty cool script.