1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[Tutorial] How to scrape web pages using PHP

Discussion in 'PHP & Perl' started by WebmasterHacks, Sep 8, 2015.

  1. WebmasterHacks

    WebmasterHacks Newbie

    Joined:
    Sep 6, 2015
    Messages:
    10
    Likes Received:
    14
    Hello BHW community!


    Today I'm gonna show you how to scrape (parse) web pages using PHP. As an example I choose PornHub :)


    Why?


    1. A lot of people using content from adult websites for their projects.
    2. It's a really good example for parsing. They have video page, categories, tags, pornstar page and so on.


    Why you should read this tutorial?


    I know there is a lot of articles about how to use PHP for parsing, but in 95% cases they suck. Just because people using old tools. I want to show you how modern PHP development looks like.


    I assume you have PHP installed and you can run in from terminal. I'm using Ubuntu. If you are using Windows everything will work fine, just make sure that you can run PHP from terminal (cmd). Just google how to add PHP in your PATH.


    Also I can't post links so I will use URLs like: dirtysite/view_video.php?viewkey=1337. Don't forget to change "dirtysite" to real domain.
     
    • Thanks Thanks x 4
  2. WebmasterHacks

    WebmasterHacks Newbie

    Joined:
    Sep 6, 2015
    Messages:
    10
    Likes Received:
    14
    Part 1. Composer.

    We will use Composer for dependency management.

    Why the hell you need to reinvent the wheel, then there is a lot of great libraries, that was used and tested by thousands of people?
    Even better! We will just say which libraries we need and Composer will download them for us. Just go and download Composer right now.

    Now we need to decide which library we will use for scraping. We can use some library to download HTML, another library to manipulate DOM or we can use Goutte.

    Description from github: Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

    Sounds good to me.

    Let's add it to our dependencies. Create folder for your project (for example "pornhub_parser") and inside it create file "composer.json".

    Content of that file should look like this:

    Code:
    {
        "require": {
            "fabpot/goutte": "3.1.*"
        }
    }
    If you don't know wtf is this, just relax. I will explain it later.

    Now just run:

    Code:
    composer install
    Composer will look what dependencies your project has and download them in "vendor" folder. Also it will create "vendor/autoload.php" file. Why?
    For example you are using 10 different libraries in you project. Library may contain hundreds of classes. Requiring them one by one would be real pain. What's why Composer creates this autoload file. One file to require them all.

    So next time we will need other libraries we will just add some dependencies in "composer.json" and run "composer update".

    Now it's time to write some code!

    Let's start from something really simple. Getting title of the video.

    Create "script.php" file in "pornhub_parser" folder and add this code:

    PHP:
    <?php

    require 'vendor/autoload.php';

    use 
    Goutte\Client;

    $client = new Client;
    $crawler $client->request('GET''dirtysite/view_video.php?viewkey=973043790');
    $title $crawler->filter('.video-wrapper .title-container .title')->first()->text();
    echo 
    $title PHP_EOL;
    Now run it:

    Code:
    php script.php
    You should see title of our video.

    How cool is that? We didn't use CURL initialization. We didn't use regular expressions to get video title from HTML. We only used one CSS selector to get elements what we need. In next part I will show another great features that you can get from Goutte.

    If you are familiar with Composer you already know how much easier life could be. If you don't, spend some time reading about it. Trust me, you will get good productivity boost.

    So your task for today is to read about Composer and some documentation/examples about Goutte.

    If you have any questions, do not hesitate to ask.
     
    • Thanks Thanks x 3
    Last edited: Sep 8, 2015
  3. fidodido

    fidodido Junior Member

    Joined:
    Aug 12, 2015
    Messages:
    113
    Likes Received:
    27
    Thanks, this is looking promising. I have never done PHP programming before, but I am well interested
     
  4. lord1027

    lord1027 Elite Member

    Joined:
    Sep 20, 2013
    Messages:
    3,177
    Likes Received:
    2,238
    When are the other parts coming out?
     
  5. davids355

    davids355 Jr. VIP Jr. VIP

    Joined:
    Apr 25, 2011
    Messages:
    9,833
    Likes Received:
    7,438
    Home Page:
    Airy to be so basic but what is composer and what are its benefits? I know a little basic php and I keep hearing about composer. It's like a framework right?
     
  6. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,566
    Likes Received:
    11,026
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    It's a package manager: https://getcomposer.org/doc/00-intro.md
     
  7. ChanzGrande

    ChanzGrande Elite Member

    Joined:
    Feb 16, 2008
    Messages:
    2,484
    Likes Received:
    1,172
    Occupation:
    Accountant
    Location:
    Northern Woods Counting Money
    Looking forward to more scraping with php information in the coming days. It would be most excellent if OP didn't make it so basic. We can proceed past 1-2 steps at a time here at BHW, and are most familiar with complete walkthroughs, so lay it us on dude ... we all want to know how to use the most modern approach to quickly and easily scrape this kind of content.
     
  8. WebmasterHacks

    WebmasterHacks Newbie

    Joined:
    Sep 6, 2015
    Messages:
    10
    Likes Received:
    14
    Part 2. Video page.

    We already know how to get title from page, let's scrape everything else.

    Porn stars... If you open your browser's developer tools you will see that porn stars links are inside element with class "video-info-row" and it contains "Pornstars" word. Also there are "Suggest" link and we don't need it.

    PHP:
    $pornstars $crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function($node) {
        return 
    $node->text();
    });
    Easy. It's important to see that I'm not creating empty array and then populating it with elements. I'm using "each" method. It works same as "array_map" function.

    Next step is categories. Only difference is that element contains "Categories" word.

    PHP:
    $categories $crawler->filter('.video-info-row:contains("Categories:") a:not(:contains("Suggest"))')->each(function($node) {
        return 
    $node->text();
    });
    Same stuff with tags. Too easy.

    Now we only need URL to video file. Open page's source code and look for "mp4" there. You will find that our URL is in JavaScript. "var player_quality_720p = 'OUR URL'".

    It's time to use regular expressions. In our case it's super easy to scrape URL.

    PHP:
    preg_match('/var player_quality_720p = \'(?<mp4>.*?)\'/'$crawler->html(), $matches);
    $mp4 $matches['mp4'];
    That's it. You have every important thing about your video.

    Our next task is to make this code reusable. How can we do that? Create a class. Why? I will show you.

    It's important to have objects that are very intuitive. Example:

    PHP:
    $video = new Video($client'dirtysite/view_video.php?viewkey=973043790');
    $video->getTitle(); // returns video title
    $video->getPornstars(); // returns array of pornstars
    $video->toArray(); // returns proper associative array
    $video->toJson(); // returns JSON string 
    Now you don't need to remember how your code works, what CSS selector you should use what was the proper JavaScript variable name in regular expression. Class will remember everything and provide you with needed methods.

    Time to create a simple class!

    PHP:
    <?php

    require __DIR__ '/vendor/autoload.php';

    use 
    Goutte\Client;

    class 
    Video {

        protected 
    $client;
        protected 
    $url;
        protected 
    $crawler;

        public function 
    __construct($client$url)
        {
            
    $this->client $client;
            
    $this->url $url;
            
    $this->crawler $crawler $client->request('GET'$url);

            
    $this->parseTitle();
            
    $this->parsePornstars();
            
    $this->parseCategories();
            
    $this->parseTags();
            
    $this->parseMp4();
        }

        public function 
    getUrl()
        {
            return 
    $this->url;
        }

        public function 
    getTitle()
        {
            return 
    $this->title;
        }

        public function 
    getPornstars()
        {
            return 
    $this->pornstars;
        }

        public function 
    getCategories()
        {
            return 
    $this->categories;
        }

        public function 
    getTags()
        {
            return 
    $this->tags;
        }

        public function 
    getMp4()
        {
            return 
    $this->mp4;
        }

        public function 
    toArray()
        {
            return [
                
    'title' => $this->getTitle(),
                
    'pornstars' => $this->getPornstars(),
                
    'categories' => $this->getCategories(),
                
    'tags' => $this->getTags(),
                
    'mp4' => $this->getMp4()
            ];
        }

        public function 
    toJson()
        {
            return 
    json_encode($this->toArray());
        }

        protected function 
    parseTitle()
        {
            
    $this->title trim($this->crawler->filter('.video-wrapper .title-container .title')->first()->text());
        }

        protected function 
    parsePornstars()
        {
            
    $this->pornstars $this->crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function($node) {
                return 
    trim($node->text());
            });
        }

        protected function 
    parseCategories()
        {
            
    $this->categories $this->crawler->filter('.video-info-row:contains("Categories:") a:not(:contains("Suggest"))')->each(function($node) {
                return 
    trim($node->text());
            });
        }

        protected function 
    parseTags()
        {
            
    $this->tags $this->crawler->filter('.video-info-row:contains("Tags:") a:not(:contains("Suggest"))')->each(function($node) {
                return 
    trim($node->text());
            });
        }

        protected function 
    parseMp4()
        {
            
    preg_match('/var player_quality_720p = \'(?<mp4>.*?)\'/'$this->crawler->html(), $matches);
            
    $this->mp4 $matches['mp4'];
        }

    }
    Let's try to use it.

    PHP:
    $client = new Client;

    $video = new Video($client'dirtysite/view_video.php?viewkey=973043790');
    var_dump($video->getTitle());
    var_dump($video->getPornstars());
    var_dump($video->getCategories());
    var_dump($video->getTags());
    var_dump($video->getMp4());
    var_dump($video->toArray());
    var_dump($video->toJson());
    That's it!

    If you are not familiar with OOP (Object Oriented Programming) you definitely should take a look at it. Don't dig to deep. You only need to understand what "private", "protected", "public", "$this", "self", "static", "__construct" means.

    Also if you are thinking why do I create "$client" and pass it into constructor (I could just create $client inside a constructor), stay tuned :)

    Any questions?

    P.S. I created a repo on github. You can check how your code should look like in the end. I can't use links yet, so try to find my nickname there.
     
    • Thanks Thanks x 3
  9. nocare

    nocare Junior Member

    Joined:
    Apr 29, 2013
    Messages:
    164
    Likes Received:
    81
    Location:
    Deep Code
    Here ya go: https://github.com/WebmasterHacks/pornhub
    Might be useful to show people how to build your css selectors as well. You kind of just threw them out there.
    A screen recorder and virtual dub can make you some nice gifs...
     
  10. Zekie

    Zekie Regular Member

    Joined:
    Jul 19, 2014
    Messages:
    265
    Likes Received:
    76
    Occupation:
    Web & Software Developer
    Location:
    Detroit
    Can this be used to scrape, lets say, videos on Youtube? If so how long does it take to do an API call for a keyword related video, could it be done on pageload and still manage to load in a decent amount of time? Thanks for your opinion!

    Peace,
    Z
     
  11. Techbiggy

    Techbiggy BANNED BANNED

    Joined:
    Aug 28, 2015
    Messages:
    54
    Likes Received:
    14
    Tested already thank you :)
     
  12. WebmasterHacks

    WebmasterHacks Newbie

    Joined:
    Sep 6, 2015
    Messages:
    10
    Likes Received:
    14
    Yes, thats true. I really don't know skill level of readers, that's why I'm skipping some parts. But I will keep in mind that this can be a problem. That's why feedback is so important. Mby will do some additional part about CSS selectors.

    This is the next step :D I'm just checking if someone are interested in this stuff.

    To check how fast it will work you can open Youtube with disabled JavaScript. We are not using real browser, so no additional AJAX request will be fired. Also no images will be downloaded. So it should work fast. But mby some content will be not available without JavaScript. For example comments are loaded dynamically and you would not be able to see them in static HTML. But you have option to profile what request are made when page is loading and just do them manually. It really depends on you task.

    Thank you! :)
     
  13. WebmasterHacks

    WebmasterHacks Newbie

    Joined:
    Sep 6, 2015
    Messages:
    10
    Likes Received:
    14
    Part 3. Porn star page.

    Now we know how to scrape data from video page URL. It's time to parse those URLs. Let's start with porn star page.

    PHP:
    $client = new Client;

    $crawler $client->request('GET''dirtysite/pornstar/madison-ivy');

    $urls $crawler->filter('.videos .videoblock .title a')->each(function($node) {
        return 
    $node->link()->getUri();
    });

    var_dump($urls);
    Done. We will get URLs from first page. Now we can loop through them and create video object for each URL to get needed data.

    PHP:
    foreach ($urls as $url) {
        
    $video = new Video($client$url);
        echo 
    $video->getTitle() . PHP_EOL;
    }
    What was easy. But now we will get few errors.

    First one:

    Code:
    PHP Notice: Undefined index: mp4 in ../script.php
    This means that our regular expression not working every time. Let's fix it.

    As you remember our pattent was: var player_quality_720p = 'OUR LINK', but sometimes there will be no 720p quality variable. By looking in source code you could find "var player_quality_240p" or "var player_quality_480p" or something else.

    There a few options here.

    1. Find first URL and return it.
    2. Find all URLs and return array.
    3. Find all URLs and return best quality URL.

    For the sake of example I will use first option.

    Let's change our regular expression.

    PHP:
    preg_match('/var player_quality_\d+p = \'(?<mp4>.*?)\'/'$this->crawler->html(), $matches);
    \d+ means: one or more digit. This pattern will match "player_quality_720p", "player_quality_1p", "player_quality_1337p", "player_quality_1234567890p" and so on.

    Now you can run script again and see that's everything works fine. Next step to create a class, just to make life easyer.

    PHP:
    class Pornstar {

        protected 
    $client;
        protected 
    $url;
        protected 
    $crawler;

        public function 
    __construct($client$url)
        {
            
    $this->client $client;
            
    $this->url $url;
            
    $this->crawler $this->client->request('GET'$url);
        }

        public function 
    getUrl()
        {
            return 
    $this->url;
        }

        public function 
    getVideoUrls()
        {
            return 
    $this->crawler->filter('.videos .videoblock .title a')->each(function(Crawler $node) {
                return 
    $node->link()->getUri();
            });
        }

    }
    So now we have URLs from first page, what about other pages?

    Our scraper should check if there are "Next" page link and if there is, go there.

    Let's create method that will check if there is "Next" page link.

    PHP:
    public function hasNextPage()
    {
        return 
    $this->crawler->filter('.pagination3 .page_next')->count() > true false;
    }
    Cool. Next method will go to next page.

    PHP:
    public function goToNextPage()
    {
        
    $link $this->crawler->filter('.pagination3 .page_next a')->link();
        
    $this->crawler $this->client->click($link);
    }
    Done. Let's check if everything works fine.

    PHP:
    $client = new Client;
    $pornstar = new Pornstar($client'dirtysite/pornstar/madison-ivy');
    var_dump($pornstar->getVideoUrls());
    $pornstar->goToNextPage();
    var_dump($pornstar->getVideoUrls());
    Run script and you should see 2 different arrays. Cool. So how do we scrape all URLs? Easy.

    PHP:
    $client = new Client;
    $pornstar = new Pornstar($client'dirtysite/pornstar/madison-ivy');
    var_dump($pornstar->getVideoUrls());
    while (
    $pornstar->hasNextPage()) {
        
    $pornstar->goToNextPage();
        
    var_dump($pornstar->getVideoUrls());
    }
    Simple. But let's make it even simplier. Let's create method that will return all URLs so we will not need to create loops ourselves.

    PHP:
    public function getAllVideoUrls()
    {
        
    $urls $this->getVideoUrls();

        while (
    $this->hasNextPage()) {
            
    $this->goToNextPage();
            
    $urls array_merge($urls$this->getVideoUrls());
        }

        return 
    $urls;
    }
    Now test our method.

    PHP:
    $client = new Client;
    $pornstar = new Pornstar($client'dirtysite/pornstar/madison-ivy');
    var_dump($pornstar->getAllVideoUrls());
    This looks better.

    Now we can scrape data from video pages like this.

    PHP:
    $client = new Client;
    $pornstar = new Pornstar($client'dirtysite/pornstar/madison-ivy');

    foreach (
    $pornstar->getAllVideoUrls() as $url) {
        
    $video = new Video($client$url);
        echo 
    $video->getTitle() . PHP_EOL;
    }
    Everything works great, but sometimes we get strange error.

    Code:
    PHP Fatal error:  Uncaught exception 'InvalidArgumentException' with message 'The current node list is empty.'
    It means that sometimes we are trying to work with unexpected HTML. Before diving into code let's think a little bit. When you make request with real browser you are sending User Agent string that contains information about your browser. We are not using real browser so what do we send in User Agent string? "Symfony2 BrowserKit". Let's change that to a normal User Agent string.

    PHP:
    $client = new Client;
    $client->setHeader('User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36');
    Sometimes it's really important to look like real browser... or google bot :)

    Also it's important to pause between requests. Right now we will use "sleep" function. Not the best way and we will improve that later.

    Next step it to add some sugar.

    Look at this code:

    PHP:
    foreach ($pornstar->getAllVideoUrls() as $url) {
        
    $video = new Video($client$url);
        echo 
    $video->getTitle() . PHP_EOL;
    }
    Can we make it better? Yes we can! I really want to use my scraper like this:

    PHP:
    foreach ($pornstar->getVideos() as $video) {
        echo 
    $video->getTitle() . PHP_EOL;
    }
    Well it's kinda easy to implement but we will have a little problem there.

    For example porn star will have 10k of videos (really hardworking person). We are scraping each video page. So we get mp4 URL from first page. It's important to tell that this URL will not work forever. Adult tubes have temporary URLs for video streaming. You can't scrape it and put it on your website. It may work for some time, but eventually URL will die and you will have tube with dead videos. Just imagine that you are scraping 1000th video page and URL from first already died.

    Few options here.

    1. Do not make request in constructor. We could have Video object initialized but data about video would not be available. When we will ask for it, scraping will happen. So now we could create array of Video objects and then start to loop through them having fresh data.
    2. Make some callback functionality. Callback will be fired when new video page scraped. Example:

    PHP:
    $pornstar->eachVideo(function($video) {
        echo 
    $video->getTitle();
    });
    I will go with first option.

    Let's change our Video class.

    Modify constructor:

    PHP:
    public function __construct(Client $client$url)
    {
        
    $this->client $client;
        
    $this->url $url;
    }
    Add new method:

    PHP:
    protected function parseIfNeeded()
    {
        if (
    $this->parsed) return;

        
    $this->crawler $this->client->request('GET'$this->getUrl());

        
    $this->parseTitle();
        
    $this->parsePornstars();
        
    $this->parseCategories();
        
    $this->parseTags();
        
    $this->parseMp4();

        
    $this->parsed true;
    }
    Modify getTitle method:

    PHP:
    public function getTitle()
    {
        
    $this->parseIfNeeded();

        return 
    $this->title;
    }
    Same with all get* methods.

    Now we can create Video object and no request will be triggered. Later we will ask for data (title for example) and everything will be scraped.

    Our Video class will look like this:

    PHP:
    class Video {

        protected 
    $client;
        protected 
    $url;
        protected 
    $crawler;
        protected 
    $title;
        protected 
    $pornstars;
        protected 
    $categories;
        protected 
    $tags;
        protected 
    $mp4;
        protected 
    $parsed false;

        public function 
    __construct(Client $client$url)
        {
            
    $this->client $client;
            
    $this->url $url;
        }

        public function 
    getUrl()
        {
            return 
    $this->url;
        }

        public function 
    getTitle()
        {
            
    $this->parseIfNeeded();

            return 
    $this->title;
        }

        public function 
    getPornstars()
        {
            
    $this->parseIfNeeded();

            return 
    $this->pornstars;
        }

        public function 
    getCategories()
        {
            
    $this->parseIfNeeded();

            return 
    $this->categories;
        }

        public function 
    getTags()
        {
            
    $this->parseIfNeeded();

            return 
    $this->tags;
        }

        public function 
    getMp4()
        {
            
    $this->parseIfNeeded();

            return 
    $this->mp4;
        }

        public function 
    toArray()
        {
            return [
                
    'title' => $this->getTitle(),
                
    'pornstars' => $this->getPornstars(),
                
    'categories' => $this->getCategories(),
                
    'tags' => $this->getTags(),
                
    'mp4' => $this->getMp4()
            ];
        }

        public function 
    toJson()
        {
            return 
    json_encode($this->toArray());
        }

        protected function 
    parseIfNeeded()
        {
            if (
    $this->parsed) return;

            
    $this->crawler $this->client->request('GET'$this->getUrl());

            
    $this->parseTitle();
            
    $this->parsePornstars();
            
    $this->parseCategories();
            
    $this->parseTags();
            
    $this->parseMp4();

            
    $this->parsed true;
        }

        protected function 
    parseTitle()
        {
            
    $this->title trim($this->crawler->filter('.video-wrapper .title-container .title')->first()->text());
        }

        protected function 
    parsePornstars()
        {
            
    $this->pornstars $this->crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function(Crawler $node) {
                return 
    trim($node->text());
            });
        }

        protected function 
    parseCategories()
        {
            
    $this->categories $this->crawler->filter('.video-info-row:contains("Categories:") a:not(:contains("Suggest"))')->each(function(Crawler $node) {
                return 
    trim($node->text());
            });
        }

        protected function 
    parseTags()
        {
            
    $this->tags $this->crawler->filter('.video-info-row:contains("Tags:") a:not(:contains("Suggest"))')->each(function(Crawler $node) {
                return 
    trim($node->text());
            });
        }

        protected function 
    parseMp4()
        {
            
    preg_match('/var player_quality_\d+p = \'(?<mp4>.*?)\'/'$this->crawler->html(), $matches);
            
    $this->mp4 $matches['mp4'];
        }

    }
    Let's add "getVideos" and "getAllVideos" to Pornstar class.

    PHP:
    public function getVideos()
    {
        return 
    array_map(function($url) {
            return new 
    Video($this->client$url);
        }, 
    $this->getVideoUrls());
    }

    public function 
    getAllVideos()
    {
        return 
    array_map(function($url) {
            return new 
    Video($this->client$url);
        }, 
    $this->getAllVideoUrls());
    }
    Let's test how it works:

    PHP:
    $client = new Client;
    $client->setHeader('User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36');
    $pornstar = new Pornstar($client'dirtysite/pornstar/madison-ivy');

    foreach (
    $pornstar->getAllVideos() as $video) {
        echo 
    $video->getTitle() . PHP_EOL;
    }
    Cool. Sometimes you will get error. Just use "sleep" inside a function.

    PHP:
    foreach ($pornstar->getAllVideos() as $video) {
        echo 
    $video->getTitle() . PHP_EOL;
        
    sleep(1);
    }
    Later we will improve this.

    P.S. Github repo updated.
     
  14. Mutikasa

    Mutikasa Power Member

    Joined:
    May 23, 2011
    Messages:
    581
    Likes Received:
    207
    how to get src attribute from img?
     
  15. WebmasterHacks

    WebmasterHacks Newbie

    Joined:
    Sep 6, 2015
    Messages:
    10
    Likes Received:
    14
    Example:

    PHP:
    $crawler->filter('.withBio img')->first()->attr('src');
     
    • Thanks Thanks x 1
  16. Skyebug77

    Skyebug77 Jr. VIP Jr. VIP

    Joined:
    Mar 22, 2012
    Messages:
    1,931
    Likes Received:
    1,354
    Occupation:
    Marketing
    Location:
    Portland,Or
    Very Nice Share. Sure lots of people will find this helpful.
     
  17. akssiv2007

    akssiv2007 Senior Member

    Joined:
    Jul 11, 2013
    Messages:
    897
    Likes Received:
    198
    Gender:
    Male
    Occupation:
    Webber
    Location:
    Earth
    Nice share op. I have been using simplehtmldom parser to collect information from various websites, will try this too.

    Thanks
     
  18. godspeed007

    godspeed007 Jr. VIP Jr. VIP

    Joined:
    Jan 10, 2013
    Messages:
    731
    Likes Received:
    272
  19. WebmasterHacks

    WebmasterHacks Newbie

    Joined:
    Sep 6, 2015
    Messages:
    10
    Likes Received:
    14
    Part 4. Refactor.

    OK. Our scraper works already, but there is lot to improve. First of all let's see what we have done.

    We have Video object that represents video page. You can ask for data and it will return it.

    We have Pornstar object that represents... well pornstar page(s). You can ask for videos.

    So Pornstar can return Videos. Should our Video also return Pornstars.

    You can do this:

    PHP:
    foreach ($pornstar->getVideos() as $video) {
        echo 
    $video->getTitle();
    }
    But you can't do like this:

    PHP:
    foreach ($video->getPornstars() as $pornstar) {
        echo 
    $pornstart->getName();
    }
    Because "getPornstars" just returns an array of names. Well in my optinion it should return array of pornstars that have "getVideos" methods. Even better. Not an array, but collection with some helpful methods.

    Also let's drop get* prefix from methods. In the end our code should look something like this:

    PHP:
    $pornstar->videos()->first()->tags()->each(function($tag) {
        echo 
    $tag->name();
    });
    We will add collections later. It's not that important right now.

    Before we start I wan't to clean our code a little bit. Right now there are 2 classes in 1 file. Not cool. We should save them seperatly.
    Also we will use namespaces. May be later we will make this a composer package and we don't want to have problems with names. User of our package may have Video class already used. Namespaces will help to solve this problem.

    I will use psr-4 autoload standard. Sound fancy? There are some standards in PHP community. psr-4 is all about "where to find classes".

    Update your composer like this:

    Code:
    {
        "require": {
            "fabpot/goutte": "3.1.*"
        },
        "autoload": {
            "psr-4": {
                "WebmasterHacks\\": "src/WebmasterHacks/"
            }
        }
    }
    Autoload part is saying: Map this namespace to this folder. That's it.

    So we will move out Video class in src/WebmasterHacks/Pornhub/Video.php file.

    Now we are using namespaces, so don't forget to add them. Top of Video.php file will look like this:

    PHP:
    <?php

    namespace WebmasterHacks\Pornhub;

    use 
    Goutte\Client;
    use 
    Symfony\Component\DomCrawler\Crawler;

    class 
    Video
    {
        
    // class declaration
    }
    Pornstar class will be moved in src/WebmasterHacks/Pornhub/Pornstar.php file.

    Don't forget to add namespace thing!

    So now we need composer to update autoload.php file.

    PHP:
    composer dump-autoload
    Also we could run:

    PHP:
    composer update
    But I'm not interested in updating dependencies as well.

    Now our script.php should look like this:

    PHP:
    <?php

    require __DIR__ '/vendor/autoload.php';

    use 
    Goutte\Client;
    use 
    WebmasterHacks\Pornhub\Pornstar;

    $client = new Client;
    $client->setHeader('User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36');
    $pornstar = new Pornstar($client'dirtysite/pornstar/madison-ivy');

    foreach (
    $pornstar->getAllVideos() as $video) {
        echo 
    $video->getTitle() . PHP_EOL;
    }
    Also if you are not comfortable with this changes and you afraid to break something just go to github and download code there. Always latest working version there.

    OK. Let's start from Pornstar class. Remove all get* prefixes. Done. Let's think a little bit. What do we need from this class. Do we really need "hasNextPage", "goToNextPage", "videoUrls", "allVideoUrls", "allVideos" methods? No.

    Really we want only "url", "name", "image" and "videos" methods. We will not delete other methods. We will hide them inside class. So user of our class will have only 4 methods to work with. Clean API. But inside we will use those methods to get URLs, finding "Next" page links and so on.

    Our task is to change "public" keyword to "protected", so our methods can be executed only inside of our class. Also we should rename "allVideos" method to "videos". We can remove old "videos" method. It returns videos from the first page... not really useful.

    Done.

    Now we can use Pornstar class like this:

    PHP:
    foreach ($pornstar->videos() as $video) {
        echo 
    $video->getTitle() . PHP_EOL;
    }
    Nice. Let's go and change Video class.

    Same with get* prefix. After that our script.php may look like this:

    PHP:
    foreach ($pornstar->videos() as $video) {
        echo 
    $video->title() . PHP_EOL;
    }
    Much cleaner.

    Now we need to fix that problem with symmetry. If Pornstar can return array of Video objects, Video object should return array of Pornstar objects. Let's update "parsePornstars" method:

    PHP:
    protected function parsePornstars()
    {
        
    $this->pornstars $this->crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function(Crawler $node) {
            return new 
    Pornstar($this->client$node->link()->getUri());
        });
    }
    That's it! You can test it:

    PHP:
    $video = new Video($client'dirtysite/view_video.php?viewkey=158407481');

    foreach (
    $video->pornstars() as $pornstar) {
        echo 
    $pornstar->url() . PHP_EOL;
    }
    Now you can do stupid stuff like this:

    PHP:
    foreach ($video->pornstars() as $pornstar) {
        foreach (
    $pornstar->videos() as $otherVideo) {
            foreach (
    $otherVideo->pornstars() as $otherPornstar) {
                echo 
    $otherPornstar->url() . PHP_EOL;
            }
        }
    }
    Don't run it :) Important thing that it works and it makes sense. Videos have pornstars, pornstars has videos that have pornstars... and so on.
     
  20. WebmasterHacks

    WebmasterHacks Newbie

    Joined:
    Sep 6, 2015
    Messages:
    10
    Likes Received:
    14
    Part 5. Collections.

    Now we will add collections to our scraper. First create src/WebmasterHacks/VideoCollection.php file with this content:

    PHP:
    <?php

    namespace WebmasterHacks\Pornhub;

    class 
    VideoCollection
    {
        protected 
    $videos = [];

        public function 
    add(Video $video)
        {
            
    array_push($this->videos$video);
        }

        public function 
    count()
        {
            return 
    count($this->videos);
        }

        public function 
    isEmpty()
        {
            return 
    $this->count() === 0;
        }

        public function 
    isNotEmpty()
        {
            return !
    $this->isEmpty();
        }

        public function 
    each(\Closure $callback)
        {
            foreach (
    $this->videos as $video) {
                
    $callback($video);
            }
        }
    }
    Second update "videos" method in Pornstar class:

    PHP:
    public function videos()
    {
        
    $videoCollection = new VideoCollection;

        foreach (
    $this->allVideoUrls() as $url) {
            
    $videoCollection->add(new Video($this->client$url));
        }

        return 
    $videoCollection;
    }
    Now "videos" will return collection with helpful methods (we will add more later). Let's check that everything works fine.

    PHP:
    $pornstar = new Pornstar($client'dirtysite/pornstar/madison-ivy');

    $pornstar->videos()->each(function(Video $video) {
        echo 
    $video->title() . PHP_EOL;
    });
    Cool.

    Also let's update "parsePornstars" method in Video class. It should return PornstarCollection. But first let's create PornstarCollection class.

    PHP:
    <?php

    namespace WebmasterHacks\Pornhub;

    class 
    PornstarCollection
    {
        protected 
    $pornstars = [];

        public function 
    add(Pornstar $pornstar)
        {
            
    array_push($this->pornstars$pornstar);
        }

        public function 
    count()
        {
            return 
    count($this->pornstars);
        }

        public function 
    isEmpty()
        {
            return 
    $this->count() === 0;
        }

        public function 
    isNotEmpty()
        {
            return !
    $this->isEmpty();
        }

        public function 
    each(\Closure $callback)
        {
            foreach (
    $this->pornstars as $pornstar) {
                
    $callback($pornstar);
            }
        }
    }
    And updated "parsePornstars" method:

    PHP:
    protected function parsePornstars()
    {
        
    $pornstarCollection = new PornstarCollection;

        
    $this->crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function(Crawler $node) use ($pornstarCollection) {
            
    $pornstarCollection->add(new Pornstar($this->client$node->link()->getUri()));
        });

        
    $this->pornstars $pornstarCollection;
    }
    Let's see how it works:

    PHP:
    $video = new Video($client'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

    $video->pornstars()->each(function(Pornstar $pornstar) {
        echo 
    $pornstar->url() . PHP_EOL;
    });
    Nice!

    Did you noticed that VideoCollection and PornstarCollection looks the same. Later we will create TagCollection and CategoryCollection. They also will look the same. So let's DRY our code (Don't repeat yourself). We will create Collection class and PornstarCollection will inherit behavior from it.

    Create src/WebmasterHacks/Pornhub/Collection.php file with this content:

    PHP:
    <?php

    namespace WebmasterHacks\Pornhub;

    abstract class 
    Collection
    {
        protected 
    $items = [];

        public function 
    add($item)
        {
            
    array_push($this->items$item);
        }

        public function 
    count()
        {
            return 
    count($this->items);
        }

        public function 
    isEmpty()
        {
            return 
    $this->count() === 0;
        }

        public function 
    isNotEmpty()
        {
            return !
    $this->isEmpty();
        }

        public function 
    each(\Closure $callback)
        {
            foreach (
    $this->items as $item) {
                
    $callback($item);
            }
        }
    }
    This is an abstract class and we can't just use it like:

    PHP:
    $collection = new Collection;
    We only can inherit from it. PornstarCollection.php now will look like:

    PHP:
    <?php

    namespace WebmasterHacks\Pornhub;

    class 
    PornstarCollection extends Collection {}
    Now check if everything works properly:

    PHP:
    $video = new Video($client'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

    $video->pornstars()->each(function(Pornstar $pornstar) {
        echo 
    $pornstar->url() . PHP_EOL;
    });
    Everything works great. Also don't forget to update VideoCollection.php

    PHP:
    <?php

    namespace WebmasterHacks\Pornhub;

    class 
    VideoCollection extends Collection {}
    Now I wan't to add 2 methods to VideoCollection: "first" and "last". I can add them to VideoCollection class OR Collection class. Well PornstarCollection should have this 2 methods as well. So let's update their parent.

    PHP:
    public function first()
    {
        if (
    $this->isEmpty()) {
            return 
    null;
        }

        return 
    $this->items[0];
    }

    public function 
    last()
    {
        if (
    $this->isEmpty())
        {
            return 
    null;
        }

        return 
    $this->items[$this->count() - 1];
    }
    And a little test:

    PHP:
    $video = new Video($client'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

    echo 
    $video->pornstars()->first()->url() . PHP_EOL;
    Next step it to create CategoryCollection and TagCollection. Easy. Create class that extends Collection and it's done.

    CategoryCollection will contain Category objects, but we don't have class for that yet. Category class will represent category page and it should return videos... Just like Pornstar class.

    Let's think about Pornstar and Category classes. Pages look's similar. They will do the same job: find video URLs, try to find "Next" page and so on. So they both should have "url" and "videos" methods. If you look at them they are not working with HTML, they work with other methods. What method look's for links? "videoUrls". So we can copy our Pornstar class, paste it into src/WebmasterHacks/Pornhub/Category.php, rename it and only update "videoUrls" method. We could, but we won't. Let's use inheritance.

    I will create abstract Taxonomy class and Pornstar, Category and Tag will inherit main behavior from it.

    PHP:
    <?php

    namespace WebmasterHacks\Pornhub;

    use 
    Goutte\Client;
    use 
    Symfony\Component\DomCrawler\Crawler;

    abstract class 
    Taxonomy
    {
        protected 
    $client;
        protected 
    $url;
        protected 
    $crawler;

        public function 
    __construct(Client $client$url)
        {
            
    $this->client $client;
            
    $this->url $url;
            
    $this->crawler $this->client->request('GET'$url);
        }

        public function 
    url()
        {
            return 
    $this->url;
        }

        public function 
    videos()
        {
            
    $videoCollection = new VideoCollection;

            foreach (
    $this->allVideoUrls() as $url) {
                
    $videoCollection->add(new Video($this->client$url));
            }

            return 
    $videoCollection;
        }

        protected abstract function 
    videoUrls();

        protected function 
    hasNextPage()
        {
            return 
    $this->crawler->filter('.pagination3 .page_next')->count() > true false;
        }

        protected function 
    goToNextPage()
        {
            
    $link $this->crawler->filter('.pagination3 .page_next a')->link();
            
    $this->crawler $this->client->click($link);
        }

        protected function 
    allVideoUrls()
        {
            
    $urls $this->videoUrls();

            while (
    $this->hasNextPage()) {
                
    $this->goToNextPage();
                
    $urls array_merge($urls$this->videoUrls());
            }

            return 
    $urls;
        }
    }
    Done. See that "protected abstract function videoUrls();"? This line of code tells us: "If you will use this class as parent, child should implement this method".

    Let's update Pornstar.php file:

    PHP:
    <?php

    namespace WebmasterHacks\Pornhub;

    class 
    Pornstar extends Taxonomy
    {
        protected function 
    videoUrls()
        {
            return 
    $this->crawler->filter('.videos .videoblock .title a')->each(function($node) {
                return 
    $node->link()->getUri();
            });
        }
    }
    That's it! Now it's time to create Category class:

    PHP:
    class Category extends Taxonomy
    {
        protected function 
    videoUrls()
        {
            return 
    $this->crawler->filter('PROPER CSS SELECTOR')->each(function($node) {
                return 
    $node->link()->getUri();
            });
        }
    }
    Cool. So what is proper CSS selector? Open your browser web tools and inspect HTML a little bit and you will notice that we can use same CSS selector as Pornstar class :D Well may be tag selector will be different. So just copy and paste selector.

    Now we will update "parseCategories" method in Video class:

    PHP:
    protected function parseCategories()
    {
        
    $categoryCollection = new CategoryCollection;

        
    $this->crawler->filter('.video-info-row:contains("Categories:") a:not(:contains("Suggest"))')->each(function(Crawler $node) use ($categoryCollection) {
            
    $categoryCollection->add(new Category($this->client$node->link()->getUri()));
        });

        
    $this->categories $categoryCollection;
    }
    Let's test our work:

    PHP:
    $video = new Video($client'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

    echo 
    'Pornstars:' PHP_EOL;

    $video->pornstars()->each(function(Pornstar $pornstar) {
        echo 
    $pornstar->url() . PHP_EOL;
    });

    echo 
    'Categories:' PHP_EOL;

    $video->categories()->each(function(Category $category) {
        echo 
    $category->url() . PHP_EOL;
    });
    Awesome :p

    Now it's time to crate Tag class. Let's see which CSS selector we can use... same...

    OK. Now we see that our videoUrls method will look the same in 3 classed. We can move it to parent class and forget about implementing it in child.

    So do the same stuff with Tag class as you did with Category class.

    And now we can run:

    PHP:
    $video = new Video($client'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

    echo 
    'Pornstars:' PHP_EOL;

    $video->pornstars()->each(function(Pornstar $pornstar) {
        echo 
    $pornstar->url() . PHP_EOL;
    });

    echo 
    'Categories:' PHP_EOL;

    $video->categories()->each(function(Category $category) {
        echo 
    $category->url() . PHP_EOL;
    });

    echo 
    'Tags:' PHP_EOL;

    $video->tags()->each(function(Tag $tag) {
        echo 
    $tag->url() . PHP_EOL;
    });
    That's it.

    Now we have much cleaner API to use. Even if you don't know what happening behind the scene you can guess.

    Any questions?
     
    • Thanks Thanks x 2