[Tutorial] How to scrape web pages using PHP

WebmasterHacks · Sep 8, 2015

Hello BHW community!

Today I'm gonna show you how to scrape (parse) web pages using PHP. As an example I choose PornHub

Why?

1. A lot of people using content from adult websites for their projects.
2. It's a really good example for parsing. They have video page, categories, tags, pornstar page and so on.

Why you should read this tutorial?

I know there is a lot of articles about how to use PHP for parsing, but in 95% cases they suck. Just because people using old tools. I want to show you how modern PHP development looks like.

I assume you have PHP installed and you can run in from terminal. I'm using Ubuntu. If you are using Windows everything will work fine, just make sure that you can run PHP from terminal (cmd). Just google how to add PHP in your PATH.

Also I can't post links so I will use URLs like: dirtysite/view_video.php?viewkey=1337. Don't forget to change "dirtysite" to real domain.

WebmasterHacks · Sep 8, 2015

Part 1. Composer.

We will use Composer for dependency management.

Why the hell you need to reinvent the wheel, then there is a lot of great libraries, that was used and tested by thousands of people?
Even better! We will just say which libraries we need and Composer will download them for us. Just go and download Composer right now.

Now we need to decide which library we will use for scraping. We can use some library to download HTML, another library to manipulate DOM or we can use Goutte.

Description from github: Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

Sounds good to me.

Let's add it to our dependencies. Create folder for your project (for example "pornhub_parser") and inside it create file "composer.json".

Content of that file should look like this:

Code:

{
    "require": {
        "fabpot/goutte": "3.1.*"
    }
}

If you don't know wtf is this, just relax. I will explain it later.

Now just run:

Code:

composer install

Composer will look what dependencies your project has and download them in "vendor" folder. Also it will create "vendor/autoload.php" file. Why?
For example you are using 10 different libraries in you project. Library may contain hundreds of classes. Requiring them one by one would be real pain. What's why Composer creates this autoload file. One file to require them all.

So next time we will need other libraries we will just add some dependencies in "composer.json" and run "composer update".

Now it's time to write some code!

Let's start from something really simple. Getting title of the video.

Create "script.php" file in "pornhub_parser" folder and add this code:

PHP:

<?php

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client;
$crawler = $client->request('GET', 'dirtysite/view_video.php?viewkey=973043790');
$title = $crawler->filter('.video-wrapper .title-container .title')->first()->text();
echo $title . PHP_EOL;

Now run it:

Code:

php script.php

You should see title of our video.

How cool is that? We didn't use CURL initialization. We didn't use regular expressions to get video title from HTML. We only used one CSS selector to get elements what we need. In next part I will show another great features that you can get from Goutte.

If you are familiar with Composer you already know how much easier life could be. If you don't, spend some time reading about it. Trust me, you will get good productivity boost.

So your task for today is to read about Composer and some documentation/examples about Goutte.

If you have any questions, do not hesitate to ask.

fidodido · Sep 8, 2015

Thanks, this is looking promising. I have never done PHP programming before, but I am well interested

lord1027 · Sep 10, 2015

When are the other parts coming out?

davids355 · Sep 10, 2015

Airy to be so basic but what is composer and what are its benefits? I know a little basic php and I keep hearing about composer. It's like a framework right?

jazzc · Sep 10, 2015

davids355 said:
Airy to be so basic but what is composer and what are its benefits? I know a little basic php and I keep hearing about composer. It's like a framework right?

It's a package manager: https://getcomposer.org/doc/00-intro.md

ChanzGrande · Sep 10, 2015

Looking forward to more scraping with php information in the coming days. It would be most excellent if OP didn't make it so basic. We can proceed past 1-2 steps at a time here at BHW, and are most familiar with complete walkthroughs, so lay it us on dude ... we all want to know how to use the most modern approach to quickly and easily scrape this kind of content.

WebmasterHacks · Sep 11, 2015

Part 2. Video page.

We already know how to get title from page, let's scrape everything else.

Porn stars... If you open your browser's developer tools you will see that porn stars links are inside element with class "video-info-row" and it contains "Pornstars" word. Also there are "Suggest" link and we don't need it.

PHP:

$pornstars = $crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function($node) {
    return $node->text();
});

Easy. It's important to see that I'm not creating empty array and then populating it with elements. I'm using "each" method. It works same as "array_map" function.

Next step is categories. Only difference is that element contains "Categories" word.

PHP:

$categories = $crawler->filter('.video-info-row:contains("Categories:") a:not(:contains("Suggest"))')->each(function($node) {
    return $node->text();
});

Same stuff with tags. Too easy.

Now we only need URL to video file. Open page's source code and look for "mp4" there. You will find that our URL is in JavaScript. "var player_quality_720p = 'OUR URL'".

It's time to use regular expressions. In our case it's super easy to scrape URL.

PHP:

preg_match('/var player_quality_720p = \'(?<mp4>.*?)\'/', $crawler->html(), $matches);
$mp4 = $matches['mp4'];

That's it. You have every important thing about your video.

Our next task is to make this code reusable. How can we do that? Create a class. Why? I will show you.

It's important to have objects that are very intuitive. Example:

PHP:

$video = new Video($client, 'dirtysite/view_video.php?viewkey=973043790');
$video->getTitle(); // returns video title
$video->getPornstars(); // returns array of pornstars
$video->toArray(); // returns proper associative array
$video->toJson(); // returns JSON string

Now you don't need to remember how your code works, what CSS selector you should use what was the proper JavaScript variable name in regular expression. Class will remember everything and provide you with needed methods.

Time to create a simple class!

PHP:

<?php

require __DIR__ . '/vendor/autoload.php';

use Goutte\Client;

class Video {

    protected $client;
    protected $url;
    protected $crawler;

    public function __construct($client, $url)
    {
        $this->client = $client;
        $this->url = $url;
        $this->crawler = $crawler = $client->request('GET', $url);

        $this->parseTitle();
        $this->parsePornstars();
        $this->parseCategories();
        $this->parseTags();
        $this->parseMp4();
    }

    public function getUrl()
    {
        return $this->url;
    }

    public function getTitle()
    {
        return $this->title;
    }

    public function getPornstars()
    {
        return $this->pornstars;
    }

    public function getCategories()
    {
        return $this->categories;
    }

    public function getTags()
    {
        return $this->tags;
    }

    public function getMp4()
    {
        return $this->mp4;
    }

    public function toArray()
    {
        return [
            'title' => $this->getTitle(),
            'pornstars' => $this->getPornstars(),
            'categories' => $this->getCategories(),
            'tags' => $this->getTags(),
            'mp4' => $this->getMp4()
        ];
    }

    public function toJson()
    {
        return json_encode($this->toArray());
    }

    protected function parseTitle()
    {
        $this->title = trim($this->crawler->filter('.video-wrapper .title-container .title')->first()->text());
    }

    protected function parsePornstars()
    {
        $this->pornstars = $this->crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function($node) {
            return trim($node->text());
        });
    }

    protected function parseCategories()
    {
        $this->categories = $this->crawler->filter('.video-info-row:contains("Categories:") a:not(:contains("Suggest"))')->each(function($node) {
            return trim($node->text());
        });
    }

    protected function parseTags()
    {
        $this->tags = $this->crawler->filter('.video-info-row:contains("Tags:") a:not(:contains("Suggest"))')->each(function($node) {
            return trim($node->text());
        });
    }

    protected function parseMp4()
    {
        preg_match('/var player_quality_720p = \'(?<mp4>.*?)\'/', $this->crawler->html(), $matches);
        $this->mp4 = $matches['mp4'];
    }

}

Let's try to use it.

PHP:

$client = new Client;

$video = new Video($client, 'dirtysite/view_video.php?viewkey=973043790');
var_dump($video->getTitle());
var_dump($video->getPornstars());
var_dump($video->getCategories());
var_dump($video->getTags());
var_dump($video->getMp4());
var_dump($video->toArray());
var_dump($video->toJson());

That's it!

If you are not familiar with OOP (Object Oriented Programming) you definitely should take a look at it. Don't dig to deep. You only need to understand what "private", "protected", "public", "$this", "self", "static", "__construct" means.

Also if you are thinking why do I create "$client" and pass it into constructor (I could just create $client inside a constructor), stay tuned

Any questions?

P.S. I created a repo on github. You can check how your code should look like in the end. I can't use links yet, so try to find my nickname there.

nocare · Sep 11, 2015

Here ya go: https://github.com/WebmasterHacks/pornhub
Might be useful to show people how to build your css selectors as well. You kind of just threw them out there.
A screen recorder and virtual dub can make you some nice gifs...

Zekie · Sep 11, 2015

Can this be used to scrape, lets say, videos on Youtube? If so how long does it take to do an API call for a keyword related video, could it be done on pageload and still manage to load in a decent amount of time? Thanks for your opinion!

Peace,
Z

Techbiggy · Sep 11, 2015

Tested already thank you

WebmasterHacks · Sep 11, 2015

nocare said:
Might be useful to show people how to build your css selectors as well. You kind of just threw them out there.

Yes, thats true. I really don't know skill level of readers, that's why I'm skipping some parts. But I will keep in mind that this can be a problem. That's why feedback is so important. Mby will do some additional part about CSS selectors.

nocare said:
A screen recorder and virtual dub can make you some nice gifs...

This is the next step

I'm just checking if someone are interested in this stuff.

Zekie said:
Can this be used to scrape, lets say, videos on Youtube? If so how long does it take to do an API call for a keyword related video, could it be done on pageload and still manage to load in a decent amount of time? Thanks for your opinion!

To check how fast it will work you can open Youtube with disabled JavaScript. We are not using real browser, so no additional AJAX request will be fired. Also no images will be downloaded. So it should work fast. But mby some content will be not available without JavaScript. For example comments are loaded dynamically and you would not be able to see them in static HTML. But you have option to profile what request are made when page is loading and just do them manually. It really depends on you task.

Techbiggy said:
Tested already thank you

Thank you!

WebmasterHacks · Sep 13, 2015

Part 3. Porn star page.

Now we know how to scrape data from video page URL. It's time to parse those URLs. Let's start with porn star page.

PHP:

$client = new Client;

$crawler = $client->request('GET', 'dirtysite/pornstar/madison-ivy');

$urls = $crawler->filter('.videos .videoblock .title a')->each(function($node) {
    return $node->link()->getUri();
});

var_dump($urls);

Done. We will get URLs from first page. Now we can loop through them and create video object for each URL to get needed data.

PHP:

foreach ($urls as $url) {
    $video = new Video($client, $url);
    echo $video->getTitle() . PHP_EOL;
}

What was easy. But now we will get few errors.

First one:

Code:

PHP Notice: Undefined index: mp4 in ../script.php

This means that our regular expression not working every time. Let's fix it.

As you remember our pattent was: var player_quality_720p = 'OUR LINK', but sometimes there will be no 720p quality variable. By looking in source code you could find "var player_quality_240p" or "var player_quality_480p" or something else.

There a few options here.

1. Find first URL and return it.
2. Find all URLs and return array.
3. Find all URLs and return best quality URL.

For the sake of example I will use first option.

Let's change our regular expression.

PHP:

preg_match('/var player_quality_\d+p = \'(?<mp4>.*?)\'/', $this->crawler->html(), $matches);

\d+ means: one or more digit. This pattern will match "player_quality_720p", "player_quality_1p", "player_quality_1337p", "player_quality_1234567890p" and so on.

Now you can run script again and see that's everything works fine. Next step to create a class, just to make life easyer.

PHP:

class Pornstar {

    protected $client;
    protected $url;
    protected $crawler;

    public function __construct($client, $url)
    {
        $this->client = $client;
        $this->url = $url;
        $this->crawler = $this->client->request('GET', $url);
    }

    public function getUrl()
    {
        return $this->url;
    }

    public function getVideoUrls()
    {
        return $this->crawler->filter('.videos .videoblock .title a')->each(function(Crawler $node) {
            return $node->link()->getUri();
        });
    }

}

So now we have URLs from first page, what about other pages?

Our scraper should check if there are "Next" page link and if there is, go there.

Let's create method that will check if there is "Next" page link.

PHP:

public function hasNextPage()
{
    return $this->crawler->filter('.pagination3 .page_next')->count() > 0 ? true : false;
}

Cool. Next method will go to next page.

PHP:

public function goToNextPage()
{
    $link = $this->crawler->filter('.pagination3 .page_next a')->link();
    $this->crawler = $this->client->click($link);
}

Done. Let's check if everything works fine.

PHP:

$client = new Client;
$pornstar = new Pornstar($client, 'dirtysite/pornstar/madison-ivy');
var_dump($pornstar->getVideoUrls());
$pornstar->goToNextPage();
var_dump($pornstar->getVideoUrls());

Run script and you should see 2 different arrays. Cool. So how do we scrape all URLs? Easy.

PHP:

$client = new Client;
$pornstar = new Pornstar($client, 'dirtysite/pornstar/madison-ivy');
var_dump($pornstar->getVideoUrls());
while ($pornstar->hasNextPage()) {
    $pornstar->goToNextPage();
    var_dump($pornstar->getVideoUrls());
}

Simple. But let's make it even simplier. Let's create method that will return all URLs so we will not need to create loops ourselves.

PHP:

public function getAllVideoUrls()
{
    $urls = $this->getVideoUrls();

    while ($this->hasNextPage()) {
        $this->goToNextPage();
        $urls = array_merge($urls, $this->getVideoUrls());
    }

    return $urls;
}

Now test our method.

PHP:

$client = new Client;
$pornstar = new Pornstar($client, 'dirtysite/pornstar/madison-ivy');
var_dump($pornstar->getAllVideoUrls());

This looks better.

Now we can scrape data from video pages like this.

PHP:

$client = new Client;
$pornstar = new Pornstar($client, 'dirtysite/pornstar/madison-ivy');

foreach ($pornstar->getAllVideoUrls() as $url) {
    $video = new Video($client, $url);
    echo $video->getTitle() . PHP_EOL;
}

Everything works great, but sometimes we get strange error.

Code:

PHP Fatal error:  Uncaught exception 'InvalidArgumentException' with message 'The current node list is empty.'

It means that sometimes we are trying to work with unexpected HTML. Before diving into code let's think a little bit. When you make request with real browser you are sending User Agent string that contains information about your browser. We are not using real browser so what do we send in User Agent string? "Symfony2 BrowserKit". Let's change that to a normal User Agent string.

PHP:

$client = new Client;
$client->setHeader('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36');

Sometimes it's really important to look like real browser... or google bot

Also it's important to pause between requests. Right now we will use "sleep" function. Not the best way and we will improve that later.

Next step it to add some sugar.

Look at this code:

PHP:

foreach ($pornstar->getAllVideoUrls() as $url) {
    $video = new Video($client, $url);
    echo $video->getTitle() . PHP_EOL;
}

Can we make it better? Yes we can! I really want to use my scraper like this:

PHP:

foreach ($pornstar->getVideos() as $video) {
    echo $video->getTitle() . PHP_EOL;
}

Well it's kinda easy to implement but we will have a little problem there.

For example porn star will have 10k of videos (really hardworking person). We are scraping each video page. So we get mp4 URL from first page. It's important to tell that this URL will not work forever. Adult tubes have temporary URLs for video streaming. You can't scrape it and put it on your website. It may work for some time, but eventually URL will die and you will have tube with dead videos. Just imagine that you are scraping 1000th video page and URL from first already died.

Few options here.

1. Do not make request in constructor. We could have Video object initialized but data about video would not be available. When we will ask for it, scraping will happen. So now we could create array of Video objects and then start to loop through them having fresh data.
2. Make some callback functionality. Callback will be fired when new video page scraped. Example:

PHP:

$pornstar->eachVideo(function($video) {
    echo $video->getTitle();
});

I will go with first option.

Let's change our Video class.

Modify constructor:

PHP:

public function __construct(Client $client, $url)
{
    $this->client = $client;
    $this->url = $url;
}

Add new method:

PHP:

protected function parseIfNeeded()
{
    if ($this->parsed) return;

    $this->crawler = $this->client->request('GET', $this->getUrl());

    $this->parseTitle();
    $this->parsePornstars();
    $this->parseCategories();
    $this->parseTags();
    $this->parseMp4();

    $this->parsed = true;
}

Modify getTitle method:

PHP:

public function getTitle()
{
    $this->parseIfNeeded();

    return $this->title;
}

Same with all get* methods.

Now we can create Video object and no request will be triggered. Later we will ask for data (title for example) and everything will be scraped.

Our Video class will look like this:

PHP:

class Video {

    protected $client;
    protected $url;
    protected $crawler;
    protected $title;
    protected $pornstars;
    protected $categories;
    protected $tags;
    protected $mp4;
    protected $parsed = false;

    public function __construct(Client $client, $url)
    {
        $this->client = $client;
        $this->url = $url;
    }

    public function getUrl()
    {
        return $this->url;
    }

    public function getTitle()
    {
        $this->parseIfNeeded();

        return $this->title;
    }

    public function getPornstars()
    {
        $this->parseIfNeeded();

        return $this->pornstars;
    }

    public function getCategories()
    {
        $this->parseIfNeeded();

        return $this->categories;
    }

    public function getTags()
    {
        $this->parseIfNeeded();

        return $this->tags;
    }

    public function getMp4()
    {
        $this->parseIfNeeded();

        return $this->mp4;
    }

    public function toArray()
    {
        return [
            'title' => $this->getTitle(),
            'pornstars' => $this->getPornstars(),
            'categories' => $this->getCategories(),
            'tags' => $this->getTags(),
            'mp4' => $this->getMp4()
        ];
    }

    public function toJson()
    {
        return json_encode($this->toArray());
    }

    protected function parseIfNeeded()
    {
        if ($this->parsed) return;

        $this->crawler = $this->client->request('GET', $this->getUrl());

        $this->parseTitle();
        $this->parsePornstars();
        $this->parseCategories();
        $this->parseTags();
        $this->parseMp4();

        $this->parsed = true;
    }

    protected function parseTitle()
    {
        $this->title = trim($this->crawler->filter('.video-wrapper .title-container .title')->first()->text());
    }

    protected function parsePornstars()
    {
        $this->pornstars = $this->crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function(Crawler $node) {
            return trim($node->text());
        });
    }

    protected function parseCategories()
    {
        $this->categories = $this->crawler->filter('.video-info-row:contains("Categories:") a:not(:contains("Suggest"))')->each(function(Crawler $node) {
            return trim($node->text());
        });
    }

    protected function parseTags()
    {
        $this->tags = $this->crawler->filter('.video-info-row:contains("Tags:") a:not(:contains("Suggest"))')->each(function(Crawler $node) {
            return trim($node->text());
        });
    }

    protected function parseMp4()
    {
        preg_match('/var player_quality_\d+p = \'(?<mp4>.*?)\'/', $this->crawler->html(), $matches);
        $this->mp4 = $matches['mp4'];
    }

}

Let's add "getVideos" and "getAllVideos" to Pornstar class.

PHP:

public function getVideos()
{
    return array_map(function($url) {
        return new Video($this->client, $url);
    }, $this->getVideoUrls());
}

public function getAllVideos()
{
    return array_map(function($url) {
        return new Video($this->client, $url);
    }, $this->getAllVideoUrls());
}

Let's test how it works:

PHP:

$client = new Client;
$client->setHeader('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36');
$pornstar = new Pornstar($client, 'dirtysite/pornstar/madison-ivy');

foreach ($pornstar->getAllVideos() as $video) {
    echo $video->getTitle() . PHP_EOL;
}

Cool. Sometimes you will get error. Just use "sleep" inside a function.

PHP:

foreach ($pornstar->getAllVideos() as $video) {
    echo $video->getTitle() . PHP_EOL;
    sleep(1);
}

Later we will improve this.

P.S. Github repo updated.

Mutikasa · Sep 14, 2015

how to get src attribute from img?

WebmasterHacks · Sep 14, 2015

Mutikasa said:
how to get src attribute from img?

Example:

PHP:

$crawler->filter('.withBio img')->first()->attr('src');

DunDidIt2X · Sep 14, 2015

Very Nice Share. Sure lots of people will find this helpful.

MKSKS · Sep 14, 2015

Nice share op. I have been using simplehtmldom parser to collect information from various websites, will try this too.

Thanks

godspeed007 · Sep 14, 2015

Amazing job OP!!!!!

WebmasterHacks · Sep 15, 2015

Part 4. Refactor.

OK. Our scraper works already, but there is lot to improve. First of all let's see what we have done.

We have Video object that represents video page. You can ask for data and it will return it.

We have Pornstar object that represents... well pornstar page(s). You can ask for videos.

So Pornstar can return Videos. Should our Video also return Pornstars.

You can do this:

PHP:

foreach ($pornstar->getVideos() as $video) {
    echo $video->getTitle();
}

But you can't do like this:

PHP:

foreach ($video->getPornstars() as $pornstar) {
    echo $pornstart->getName();
}

Because "getPornstars" just returns an array of names. Well in my optinion it should return array of pornstars that have "getVideos" methods. Even better. Not an array, but collection with some helpful methods.

Also let's drop get* prefix from methods. In the end our code should look something like this:

PHP:

$pornstar->videos()->first()->tags()->each(function($tag) {
    echo $tag->name();
});

We will add collections later. It's not that important right now.

Before we start I wan't to clean our code a little bit. Right now there are 2 classes in 1 file. Not cool. We should save them seperatly.
Also we will use namespaces. May be later we will make this a composer package and we don't want to have problems with names. User of our package may have Video class already used. Namespaces will help to solve this problem.

I will use psr-4 autoload standard. Sound fancy? There are some standards in PHP community. psr-4 is all about "where to find classes".

Update your composer like this:

Code:

{
    "require": {
        "fabpot/goutte": "3.1.*"
    },
    "autoload": {
        "psr-4": {
            "WebmasterHacks\\": "src/WebmasterHacks/"
        }
    }
}

Autoload part is saying: Map this namespace to this folder. That's it.

So we will move out Video class in src/WebmasterHacks/Pornhub/Video.php file.

Now we are using namespaces, so don't forget to add them. Top of Video.php file will look like this:

PHP:

<?php

namespace WebmasterHacks\Pornhub;

use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

class Video
{
    // class declaration
}

Pornstar class will be moved in src/WebmasterHacks/Pornhub/Pornstar.php file.

Don't forget to add namespace thing!

So now we need composer to update autoload.php file.

PHP:

composer dump-autoload

Also we could run:

PHP:

composer update

But I'm not interested in updating dependencies as well.

Now our script.php should look like this:

PHP:

<?php

require __DIR__ . '/vendor/autoload.php';

use Goutte\Client;
use WebmasterHacks\Pornhub\Pornstar;

$client = new Client;
$client->setHeader('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36');
$pornstar = new Pornstar($client, 'dirtysite/pornstar/madison-ivy');

foreach ($pornstar->getAllVideos() as $video) {
    echo $video->getTitle() . PHP_EOL;
}

Also if you are not comfortable with this changes and you afraid to break something just go to github and download code there. Always latest working version there.

OK. Let's start from Pornstar class. Remove all get* prefixes. Done. Let's think a little bit. What do we need from this class. Do we really need "hasNextPage", "goToNextPage", "videoUrls", "allVideoUrls", "allVideos" methods? No.

Really we want only "url", "name", "image" and "videos" methods. We will not delete other methods. We will hide them inside class. So user of our class will have only 4 methods to work with. Clean API. But inside we will use those methods to get URLs, finding "Next" page links and so on.

Our task is to change "public" keyword to "protected", so our methods can be executed only inside of our class. Also we should rename "allVideos" method to "videos". We can remove old "videos" method. It returns videos from the first page... not really useful.

Done.

Now we can use Pornstar class like this:

PHP:

foreach ($pornstar->videos() as $video) {
    echo $video->getTitle() . PHP_EOL;
}

Nice. Let's go and change Video class.

Same with get* prefix. After that our script.php may look like this:

PHP:

foreach ($pornstar->videos() as $video) {
    echo $video->title() . PHP_EOL;
}

Much cleaner.

Now we need to fix that problem with symmetry. If Pornstar can return array of Video objects, Video object should return array of Pornstar objects. Let's update "parsePornstars" method:

PHP:

protected function parsePornstars()
{
    $this->pornstars = $this->crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function(Crawler $node) {
        return new Pornstar($this->client, $node->link()->getUri());
    });
}

That's it! You can test it:

PHP:

$video = new Video($client, 'dirtysite/view_video.php?viewkey=158407481');

foreach ($video->pornstars() as $pornstar) {
    echo $pornstar->url() . PHP_EOL;
}

Now you can do stupid stuff like this:

PHP:

foreach ($video->pornstars() as $pornstar) {
    foreach ($pornstar->videos() as $otherVideo) {
        foreach ($otherVideo->pornstars() as $otherPornstar) {
            echo $otherPornstar->url() . PHP_EOL;
        }
    }
}

Don't run it

Important thing that it works and it makes sense. Videos have pornstars, pornstars has videos that have pornstars... and so on.

WebmasterHacks · Sep 15, 2015

Part 5. Collections.

Now we will add collections to our scraper. First create src/WebmasterHacks/VideoCollection.php file with this content:

PHP:

<?php

namespace WebmasterHacks\Pornhub;

class VideoCollection
{
    protected $videos = [];

    public function add(Video $video)
    {
        array_push($this->videos, $video);
    }

    public function count()
    {
        return count($this->videos);
    }

    public function isEmpty()
    {
        return $this->count() === 0;
    }

    public function isNotEmpty()
    {
        return !$this->isEmpty();
    }

    public function each(\Closure $callback)
    {
        foreach ($this->videos as $video) {
            $callback($video);
        }
    }
}

Second update "videos" method in Pornstar class:

PHP:

public function videos()
{
    $videoCollection = new VideoCollection;

    foreach ($this->allVideoUrls() as $url) {
        $videoCollection->add(new Video($this->client, $url));
    }

    return $videoCollection;
}

Now "videos" will return collection with helpful methods (we will add more later). Let's check that everything works fine.

PHP:

$pornstar = new Pornstar($client, 'dirtysite/pornstar/madison-ivy');

$pornstar->videos()->each(function(Video $video) {
    echo $video->title() . PHP_EOL;
});

Cool.

Also let's update "parsePornstars" method in Video class. It should return PornstarCollection. But first let's create PornstarCollection class.

PHP:

<?php

namespace WebmasterHacks\Pornhub;

class PornstarCollection
{
    protected $pornstars = [];

    public function add(Pornstar $pornstar)
    {
        array_push($this->pornstars, $pornstar);
    }

    public function count()
    {
        return count($this->pornstars);
    }

    public function isEmpty()
    {
        return $this->count() === 0;
    }

    public function isNotEmpty()
    {
        return !$this->isEmpty();
    }

    public function each(\Closure $callback)
    {
        foreach ($this->pornstars as $pornstar) {
            $callback($pornstar);
        }
    }
}

And updated "parsePornstars" method:

PHP:

protected function parsePornstars()
{
    $pornstarCollection = new PornstarCollection;

    $this->crawler->filter('.video-info-row:contains("Pornstars:") a:not(:contains("Suggest"))')->each(function(Crawler $node) use ($pornstarCollection) {
        $pornstarCollection->add(new Pornstar($this->client, $node->link()->getUri()));
    });

    $this->pornstars = $pornstarCollection;
}

Let's see how it works:

PHP:

$video = new Video($client, 'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

$video->pornstars()->each(function(Pornstar $pornstar) {
    echo $pornstar->url() . PHP_EOL;
});

Nice!

Did you noticed that VideoCollection and PornstarCollection looks the same. Later we will create TagCollection and CategoryCollection. They also will look the same. So let's DRY our code (Don't repeat yourself). We will create Collection class and PornstarCollection will inherit behavior from it.

Create src/WebmasterHacks/Pornhub/Collection.php file with this content:

PHP:

<?php

namespace WebmasterHacks\Pornhub;

abstract class Collection
{
    protected $items = [];

    public function add($item)
    {
        array_push($this->items, $item);
    }

    public function count()
    {
        return count($this->items);
    }

    public function isEmpty()
    {
        return $this->count() === 0;
    }

    public function isNotEmpty()
    {
        return !$this->isEmpty();
    }

    public function each(\Closure $callback)
    {
        foreach ($this->items as $item) {
            $callback($item);
        }
    }
}

This is an abstract class and we can't just use it like:

PHP:

$collection = new Collection;

We only can inherit from it. PornstarCollection.php now will look like:

PHP:

<?php

namespace WebmasterHacks\Pornhub;

class PornstarCollection extends Collection {}

Now check if everything works properly:

PHP:

$video = new Video($client, 'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

$video->pornstars()->each(function(Pornstar $pornstar) {
    echo $pornstar->url() . PHP_EOL;
});

Everything works great. Also don't forget to update VideoCollection.php

PHP:

<?php

namespace WebmasterHacks\Pornhub;

class VideoCollection extends Collection {}

Now I wan't to add 2 methods to VideoCollection: "first" and "last". I can add them to VideoCollection class OR Collection class. Well PornstarCollection should have this 2 methods as well. So let's update their parent.

PHP:

public function first()
{
    if ($this->isEmpty()) {
        return null;
    }

    return $this->items[0];
}

public function last()
{
    if ($this->isEmpty())
    {
        return null;
    }

    return $this->items[$this->count() - 1];
}

And a little test:

PHP:

$video = new Video($client, 'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

echo $video->pornstars()->first()->url() . PHP_EOL;

Next step it to create CategoryCollection and TagCollection. Easy. Create class that extends Collection and it's done.

CategoryCollection will contain Category objects, but we don't have class for that yet. Category class will represent category page and it should return videos... Just like Pornstar class.

Let's think about Pornstar and Category classes. Pages look's similar. They will do the same job: find video URLs, try to find "Next" page and so on. So they both should have "url" and "videos" methods. If you look at them they are not working with HTML, they work with other methods. What method look's for links? "videoUrls". So we can copy our Pornstar class, paste it into src/WebmasterHacks/Pornhub/Category.php, rename it and only update "videoUrls" method. We could, but we won't. Let's use inheritance.

I will create abstract Taxonomy class and Pornstar, Category and Tag will inherit main behavior from it.

PHP:

<?php

namespace WebmasterHacks\Pornhub;

use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

abstract class Taxonomy
{
    protected $client;
    protected $url;
    protected $crawler;

    public function __construct(Client $client, $url)
    {
        $this->client = $client;
        $this->url = $url;
        $this->crawler = $this->client->request('GET', $url);
    }

    public function url()
    {
        return $this->url;
    }

    public function videos()
    {
        $videoCollection = new VideoCollection;

        foreach ($this->allVideoUrls() as $url) {
            $videoCollection->add(new Video($this->client, $url));
        }

        return $videoCollection;
    }

    protected abstract function videoUrls();

    protected function hasNextPage()
    {
        return $this->crawler->filter('.pagination3 .page_next')->count() > 0 ? true : false;
    }

    protected function goToNextPage()
    {
        $link = $this->crawler->filter('.pagination3 .page_next a')->link();
        $this->crawler = $this->client->click($link);
    }

    protected function allVideoUrls()
    {
        $urls = $this->videoUrls();

        while ($this->hasNextPage()) {
            $this->goToNextPage();
            $urls = array_merge($urls, $this->videoUrls());
        }

        return $urls;
    }
}

Done. See that "protected abstract function videoUrls();"? This line of code tells us: "If you will use this class as parent, child should implement this method".

Let's update Pornstar.php file:

PHP:

<?php

namespace WebmasterHacks\Pornhub;

class Pornstar extends Taxonomy
{
    protected function videoUrls()
    {
        return $this->crawler->filter('.videos .videoblock .title a')->each(function($node) {
            return $node->link()->getUri();
        });
    }
}

That's it! Now it's time to create Category class:

PHP:

class Category extends Taxonomy
{
    protected function videoUrls()
    {
        return $this->crawler->filter('PROPER CSS SELECTOR')->each(function($node) {
            return $node->link()->getUri();
        });
    }
}

Cool. So what is proper CSS selector? Open your browser web tools and inspect HTML a little bit and you will notice that we can use same CSS selector as Pornstar class

Well may be tag selector will be different. So just copy and paste selector.

Now we will update "parseCategories" method in Video class:

PHP:

protected function parseCategories()
{
    $categoryCollection = new CategoryCollection;

    $this->crawler->filter('.video-info-row:contains("Categories:") a:not(:contains("Suggest"))')->each(function(Crawler $node) use ($categoryCollection) {
        $categoryCollection->add(new Category($this->client, $node->link()->getUri()));
    });

    $this->categories = $categoryCollection;
}

Let's test our work:

PHP:

$video = new Video($client, 'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

echo 'Pornstars:' . PHP_EOL;

$video->pornstars()->each(function(Pornstar $pornstar) {
    echo $pornstar->url() . PHP_EOL;
});

echo 'Categories:' . PHP_EOL;

$video->categories()->each(function(Category $category) {
    echo $category->url() . PHP_EOL;
});

Awesome

Now it's time to crate Tag class. Let's see which CSS selector we can use... same...

OK. Now we see that our videoUrls method will look the same in 3 classed. We can move it to parent class and forget about implementing it in child.

So do the same stuff with Tag class as you did with Category class.

And now we can run:

PHP:

$video = new Video($client, 'dirtysite/view_video.php?viewkey=ph55f4113f77d67');

echo 'Pornstars:' . PHP_EOL;

$video->pornstars()->each(function(Pornstar $pornstar) {
    echo $pornstar->url() . PHP_EOL;
});

echo 'Categories:' . PHP_EOL;

$video->categories()->each(function(Category $category) {
    echo $category->url() . PHP_EOL;
});

echo 'Tags:' . PHP_EOL;

$video->tags()->each(function(Tag $tag) {
    echo $tag->url() . PHP_EOL;
});

That's it.

Now we have much cleaner API to use. Even if you don't know what happening behind the scene you can guess.

Any questions?

[Tutorial] How to scrape web pages using PHP

Newbie

Newbie

Junior Member

Elite Member

Super Moderator

Elite Member

Elite Member

Newbie

Junior Member

Regular Member

BANNED

Newbie

Newbie

Power Member

Newbie

Elite Member

Banned for failing to resolve dispute.

BANNED

Newbie

Newbie

Main Menu

Marketplace

Making Money

BlackHat World