Scraping Sites - Curl Method

 

Results 1 to 13 of 13
I have been dealing with a large amount of site scrapings lately and have tweaked ...
  1. #1
    Lyscer's Avatar
    Lyscer is offline Jr. VIP
    Join Date
    Jun 2012
    Posts
    108
    Thanks
    9
    Thanked 40 Times in 23 Posts

    Default Scraping Sites - Curl Method

    I have been dealing with a large amount of site scrapings lately and have tweaked my generic "Curl" function like mad so that it fits many situations. I wanted to share it on here and give back a little; Hopefully it helps some of you out. Please feel free to ask any questions. I have used this method to do the following:

    - Login to sites using basic http authentication
    - Login to sites using sessions and CSRF authentication
    - Login to a site, switch pages and download a csv file which I then parsed
    - many other things

    so as you can see the function can be applied to many situations, it is just however you want to apply it.

    Code:
    /**
         * 
         * @param type $url
         * @param array $params -> array('cookie_file'=>'path_to_cookie_file',
         *                               'start_new_cookie'=>false,
         *                               'cookie_jar'=>'cookie_file_name',
         *                               'user_agent'=>'the_user_agent',
         *                               'post_params'=>'var=1&var2=2',
         *                               'follow_location'=>false,
         *                               'http_referer'=>'specify a page that referred you',
         *                               'header'=>'header_array',
         *                               'debug'=>false,
         *                               'redirect_call_back'=>'function_to_call');
         * @return array('html'=>'the html of the page that was retrived',
         *               'info'=>'the header information of the page that was retrieved');
         */
        private function getWebPage($url, $params=array()){
            $return = array();
    
    
            $ch = curl_init($url);
            
            if(isset($params['cookie_file']) && $params['cookie_file'] != '')
            {
                // Forces a new Session
                if(isset($params['start_new_cookie']) && $params['start_new_cookie'])
                    curl_setopt($ch, CURLOPT_COOKIESESSION, 1);
                
                curl_setopt($ch, CURLOPT_COOKIEJAR, $params['cookie_file']);
                curl_setopt($ch, CURLOPT_COOKIEFILE, $params['cookie_jar']);
            }
            if(isset($params['user_agent']))
                curl_setopt($ch, CURLOPT_USERAGENT, $params['user_agent']);
            if(isset($params['post_params']) && $params['post_params'] != '')
            {
                curl_setopt($ch, CURLOPT_POST, 1);
                curl_setopt($ch, CURLOPT_POSTFIELDS, $params['post_params']);
            }
            
            if(isset($params['header']))
            {
                curl_setopt($ch, CURLOPT_HTTPHEADER, $params['header']);
            }
            
            if(isset($params['debug']) && $params['debug'] == true)
            {
                curl_setopt($ch, CURLINFO_HEADER_OUT, true);
                curl_setopt($ch, CURLOPT_HEADER, 1);
                curl_setopt($ch, CURLOPT_VERBOSE, 1);
            }
            if(isset($params['http_referer']))
                curl_setopt($ch, CURLOPT_REFERER, $params['http_referer']);
            else
                curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
            
            curl_setopt($ch, CURLOPT_TIMEOUT, 600);
            curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  
            
            if(isset($params['follow_location']))
                curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $params['follow_location']);
            else 
                curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
            
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false ) ;
            curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false ) ;
            
            $html = curl_exec($ch);
            $info = curl_getinfo($ch);
            $http_code = $info['http_code'];
            
            $return['html'] = $html;
            $return['info'] = $info;
            
            if(isset($params['debug']) && $params['debug'] == true)
            {
                echo "http_code: $http_code<br />";
                echo 'Debug: <br /><pre>'.print_r($return['info'], true).'</pre><br />';
                echo 'output: <br /><pre>'.htmlentities($return['html']).'</pre><br />';
                echo '----------------------------------------<br />';
            }
            
            if(isset($info['redirect_url']) && strlen($info['redirect_url']) > 4 ) {
                if(isset($params['redirect_call_back'])){ // callback
                    $redirectcallback = $params['redirect_call_back'];
                    $this->$redirectcallback($info['redirect_url'], $url);
                }
                
                $return = $this->getWebPage($info['redirect_url'], $params);
            } else if(strpos($html, 'CONTENT="0;'))
            {
                // look for a meta refresh, if one exists, then redirect to it 
                $meta = substr($html, strpos($html, 'CONTENT="0;')+11);
                $redirect_url = substr($meta, 0, strpos($meta, '"'));
    
    
                $redirectcallback = $params['redirect_call_back'];
                $this->$redirectcallback($redirect_url, $url);
                
                echo "Meta Refresh found!<br />";
                $return = $this->getWebPage($info['redirect_url'], $params);
            }
            
            return $return;
        }
    You would simple put that method in your php project and then you can call it super simple like:

    Code:
    $results = $this->getWebPage('http://www.google.com');
    If you are looking for something a little more complex and would like to login to a site, pretend to be firefox on a mac, see debugging information and save the cookies:
    Code:
    $params = array('debug'=>true,
                               'post_params'=>'username=username&password=password',
                               'cookie_jar'=>'path/to/cookie/cookie.txt',
                               'cookie_file'=>'path/to/cookie/cookie.txt',
                               'user_agent'=>'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20100101 Firefox/17.0');
    $results = $this->getWebPage('http://www.domain.com/login', $params);
    ** Note: this is just a Curl helper method that I have been tweaking to allow me to login to sites that I tell it to. I know that I login to site 'X' and always get content 'Y' so I use a tool TamperHeaders in firefox to tell me what fields are being submitted with the form and then put those as the post params. I have other functions that perform the processing, I just wanted to share this as it was a PITA learning it and figuring out how everything went together.

  2. The Following 9 Users Say Thank You to Lyscer For This Useful Post:

    arogers (09-15-2013), davids355 (06-28-2013), eddie42 (06-29-2013), ender78 (06-28-2013), fopen (06-28-2013), g0g0l (06-28-2013), innozemec (06-20-2013), leobar (06-28-2013), nopme88 (08-01-2013)




  3. #2
    innozemec's Avatar
    innozemec is online now ★★ InstantLinkIndexer.com
    Join Date
    Aug 2011
    Location
    www.Indexification.com
    Posts
    4,628
    Thanks
    1,050
    Thanked 1,462 Times in 1,200 Posts
    Blog Entries
    3

    Default Re: Scraping Sites - Curl Method

    Looking good bro, thanks for contributing to the community!

  4. #3
    Treeofl1's Avatar
    Treeofl1 is offline Web Design bit.ly/lizdes
    Join Date
    Nov 2012
    Location
    South Africa
    Age
    21
    Posts
    1,766
    Thanks
    3,500
    Thanked 2,580 Times in 1,152 Posts

    Default Re: Scraping Sites - Curl Method

    Good contribution dude, keep it up!

  5. #4
    fopen's Avatar
    fopen is offline Newbies
    Join Date
    Aug 2010
    Posts
    4
    Thanks
    9
    Thanked 0 Times in 0 Posts

    Default Re: Scraping Sites - Curl Method

    Thanks for this great post man ! Do you know if it has problems working with the new 5.4.0 PHP or 5.5.0 ? Will run some tests and let you know. Since register_globals is removed I had to redo my scripts entirely.

  6. #5
    davids355's Avatar
    davids355 is online now http://wordai.com/?ref=62
    Join Date
    Apr 2011
    Location
    /root
    Age
    31
    Posts
    4,680
    Thanks
    3,804
    Thanked 3,244 Times in 1,907 Posts

    Default Re: Scraping Sites - Curl Method

    very nice script mate. Was actually looking for something like this a while back.

  7. #6
    g0g0l is offline Jr. VIP
    Join Date
    Sep 2010
    Posts
    2,233
    Thanks
    3,676
    Thanked 1,995 Times in 1,090 Posts
    Blog Entries
    3

    Default Re: Scraping Sites - Curl Method

    Good piece of code. Keep up the good works dude

  8. #7
    rrocha80's Avatar
    rrocha80 is offline Jr. VIP
    Join Date
    Apr 2013
    Posts
    134
    Thanks
    0
    Thanked 24 Times in 22 Posts

    Default Re: Scraping Sites - Curl Method

    Great contribution, thank you!

  9. #8
    Lyscer's Avatar
    Lyscer is offline Jr. VIP
    Join Date
    Jun 2012
    Posts
    108
    Thanks
    9
    Thanked 40 Times in 23 Posts

    Default Re: Scraping Sites - Curl Method

    Quote Originally Posted by fopen View Post
    Thanks for this great post man ! Do you know if it has problems working with the new 5.4.0 PHP or 5.5.0 ? Will run some tests and let you know. Since register_globals is removed I had to redo my scripts entirely.
    I am currently running it on a machine with 5.4.4 and have zero issues. I looked at the php change log for php 5.5 and it doesn't look like there would need to be any code changes for it to work with 5.5. Glad you guys like it, let me know if you have any questions.

  10. #9
    Lyscer's Avatar
    Lyscer is offline Jr. VIP
    Join Date
    Jun 2012
    Posts
    108
    Thanks
    9
    Thanked 40 Times in 23 Posts

    Default Re: Scraping Sites - Curl Method

    I thought that I should add a little bit to this just to clarify.

    Sometimes curl will temp you to login to a site and traverse the dom to manipulate the values that you need out of an HTML page. However, always look to google or the companies website to make sure that they don't have an API readily available. If they have an api then the above code will allow you to call their api calls, you just have to call it correctly. If an api doesn't exist then and only then should you put the time into traversing the dom. This is because all it takes is a small html update and your traversals are screwed for the most part. Good luck and keep the questions coming.

  11. #10
    nopme88's Avatar
    nopme88 is offline Registered Member
    Join Date
    Jul 2013
    Posts
    50
    Thanks
    13
    Thanked 7 Times in 4 Posts

    Default Re: Scraping Sites - Curl Method

    Keep it up mate! Thank you

  12. #11
    methylenebl is offline Newbies
    Join Date
    Aug 2013
    Posts
    22
    Thanks
    3
    Thanked 4 Times in 4 Posts

    Default Re: Scraping Sites - Curl Method

    Amazing work!

  13. #12
    arogers is offline Newbies
    Join Date
    Sep 2013
    Posts
    13
    Thanks
    21
    Thanked 0 Times in 0 Posts

    Default Re: Scraping Sites - Curl Method

    Very cool piece of code, just what i was looking for

  14. #13
    davids355's Avatar
    davids355 is online now http://wordai.com/?ref=62
    Join Date
    Apr 2011
    Location
    /root
    Age
    31
    Posts
    4,680
    Thanks
    3,804
    Thanked 3,244 Times in 1,907 Posts


Similar Threads

  1. Replies: 3
    Last Post: 06-25-2013, 06:52 AM
  2. Replies: 2
    Last Post: 01-24-2011, 03:13 AM
  3. [HELP] Unable to submit phpld sites in php-curl?
    By sverdlow in forum PHP & Perl
    Replies: 1
    Last Post: 05-23-2010, 01:19 AM
  4. PHP / cURL Page Scraping Script
    By markdigerati in forum PHP & Perl
    Replies: 2
    Last Post: 12-18-2008, 10:09 AM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  




BlackHatWorld on Twitter BlackHatWorld on FaceBook


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98