A small guide to distributed crawling

pasdoy · Feb 1, 2018

I know crawling interest a lot of people so I decided to post a small example of how to scale it and what tech to use. Most threads I see or tools in BST highlight multi-threads as a feature and are proud of crawling 50,000 pages / hour, depending on your internet connection. But what if you want to crawl millions of pages per hours and gather stats about links, anchor, expired domains, etc?

To achieve high scalability, we have to work differently. It does not crawl javascript based website like Angular or React. This is not a setup guide, mostly an overall guide.

1. Technology used
- Multiple VPS, we probably all know about this
- S3, AWS Simple Storage Service
- EMR, AWS MapReduce on demand using Hadoop https://en.wikipedia.org/wiki/MapReduce
- RabbitMQ (or AWS SQS), message brokers https://en.wikipedia.org/wiki/Message_broker
- SQLite, simple relational database
- Python, Golang

Why Go? Because I've been using it for a while and it has a good concurrency pattern. No JVM, works on any architecture.
Why Python? Because of the library used to handle the clusters and jobs. Also python is a simple straightforward language.

The MapReduce is written in Python. Workers are written in Go.

2. How they interact with each other

[S3 with crawled logs] -> [dispatcher on EMR] -> [RabbitMQ with links to crawl] -> [HTTP GET / parser workers] -> [Produce logs] -> [S3]

- S3 with crawled logs

All logs from workers go in a S3 bucket. The key is s3://bucket/year/month/day/. This key system allow us to select the last n days easily. There are 2 types of logged lines: SeenURL and CrawledURL. SeenURL tells that we saw the URL and parsed it. CrawledURL are discovered links on the crawled pages. Those are written directly in a text file. This is why you need a very fast log library. I had to change it twice to find the fastest one as writing logs can be CPU or I/O intensive, especially in Go. We compress the logs to BZip, it can be easily read by Hadoop. Don't use GZip. Hadoop can't split them.

SeenURL message example

Code:

{"msg":"SeenURL","OriginalURL":"https://www.zappos.com/independence-day-clothing-co/Wg-wHvEEJNwDhCXkB5Ik1CTiAgEL.zso?s=recentSalesStyle/desc/","ResultURL":"https://www.zappos.com/independence-day-clothing-co/Wg-wHvEEJNwDhCXkB9QkkiTiAgEL.zso?s=recentSalesStyle/desc/","HTTPStatusCode":301}
{"msg":"SeenURL","OriginalURL":"https://www.zappos.com/independence-day-clothing-co-women/Wgi9FIQjzQTUJMABAeABAeICAwsYHA.zso","ResultURL":"https://www.zappos.com/independence-day-clothing-co-women/Wgi9FNQkhCPNBMABAeABAeICAwsYHA.zso","HTTPStatusCode":301}

CrawledURL message example

Code:

{"msg":"CrawledURL","OnPage":"https://www.zappos.com/independence-day-clothing-co-men-shirts-tops/CKvXARDL1wFaCO8LxQXBI9QkegLkBIIBA9vuBMABAuICBQECCxgP.zso","OriginalLink":"/men-hats/COfWARCJ1wHAAQLiAgMBAhg.zso","CleanedLink":"https://www.zappos.com/men-hats/COfWARCJ1wHAAQLiAgMBAhg.zso","LinkRel":"","Anchor":"Hats"}

Now that we have logs of seen url and newly discovered ones, we need to dispatch in the queue the next batch of links to crawl. At the speed we crawl, we end up with terabytes of logs to parse. This is why we need MapReduce.

- Dispatcher on EMR

EMR is an AWS service that let you spawn clusters to run Haddop MapReduce jobs on demand. Combined with Spot instances, you can get cheap computer power. EMR is billed by minutes or seconds now, no need to pace your jobs to take 1h.

We use this library to facilitate everything, from creating the cluster to running the job and checking the output https://github.com/Yelp/mrjob.

The dispacher makes sure we only queue links we haven't seen and we are authorized to see. This is where SQLite comes into action. SQLite holds the robots.txt info we gather for ach hostname. There is one VPS dedicted to listen to the "robots" queue, read the robots.txt if it exists and save it in the SQLite database. It uploads the SQLite dtabase to s3 regularly, this way it can be used in the MapReduce. Every MapReduce job, the SQLite database is used to know if we can or cannot crawl specific URLs. If the hostname isn't in the databse, we queue it to crawl robots.txt. If it is, we queue the link we can crawl according to the robots.txt.

Issues about the job:
- The reduce use hostnme as key. A hostname with millions of URLs will take longer time to process by the reducer process. Your job might end up witing for the last reducer to finish while other processes are just waiting.
- Random connection issues with RabbitMQ would make the job fail.

- RabbitMQ with links to crawl

Rabbitmq is a message system used to dispatch among our VPS workers. Being monetary ressourceful, someone can use SQS which you don't have to devops at all. The current approach requires a new VPS, with good amount of ram, 32-128gb, to dispatch our messages.

We have 2 queues, robots and tocrawl.

robots is to crawl robots.txt from hostnames.
tocrawl is url that need to be crawled

- HTTP GET / parser workers

Robots.txt
1 VPS dedicated to listening at the queue "robots". It gather the hostname's robots.txt and save it in the SQLite database. It upload the SQLite database regularly on S3 to be used by the MapReduce.

Crawlers
N number of VPS all running our worker executable. They listen to the queue "tocrawl". They crawl the urls and write the logs. We use logrotate https://linux.die.net/man/8/logrotate to upload and rotate logs every 250mb of logs. Note that we compress Bzip logs at level 6, giving a good tradeoff of compress speed/archive size. The logs ar eperiodicaly sent to S3 this way. It avoids to overflow the filesystem.

Notes
- We catch 300 redirect and don't follow them to make sure we haven't already crawled that URL yet. It makes good stats also to know where X is redirected.
- We use Zap log instead of Logrus for speed. Logrus was our bottle neck at first. Thanks to reflect.
- We use google bot uer-agent to get better data.

- Get the machine started
Once you have all your VPS up, you need to insert some seed links in the queue "tocrawl". The best is to use different domains. Workers will crawl the new links. On your first MapReduce run, all the robots.txt will be missing. Fear not, the mapreduce will send them the "robots" queue. But this run no links will be crawled since they were all missing a robots.txt entry. The next run they will be queued. At first you will find yourself running the job manually more often because there's not a lot of links. Soon you will see 100 millions + queued urls.

3. More info.
- I tested this in production for 2 weeks. Ran out of dollar to continue since it was done only to post here.
- I used AWS Lightsail instances. I created a base snapshot and would replicate it when needed. AWS shared my CPU since I was consuming too much of it. Be vigilant.
- RabbbitMQ has a neat admin view, you should enable it to track message count and find dead workers, if any.
- The MapReduce was a cron ran every 12hrs. This could change depending on how long you crawl. It would run from 5 minutes to 115 minutes at the end of the 2 weeks. I was using a 1 Master and 10 Slaves setup. You could speed this up with more slave.
- You could cache all pages in S3 if you have the need and money for it.
- The SQLite database is also used to ban hostname. When a hostname is banned they are simply ignored during dispatching. Maybe you don't want to crawl amazon or google.
- Some links are crawled twice because of logs upload delay. It's ok for now.
- There's a lot of devops needed to run this. I used Datadog to track everything.

4. What to do from here
- Add a pacing system, it could avoid overcrawling the same website too quickly.
- Fix bug when parsing the anchor link when it's an html tag.

5. Enough talking, let the code speak:

- Golang Crawler Worker
https://gist.github.com/pasdoy/66421b4147c7dd15f4d15d7ab0c6a31a

- Golang Robots.txt crawler
https://gist.github.com/pasdoy/a6c791b9ca10ad4254804c28d9612ecc

- mrjob dispatch job
https://gist.github.com/pasdoy/241720d0a2f68cc711820ba6f3ba5633

- SQLite schema
https://gist.github.com/pasdoy/ede350ca44430ef166e85703197fff39

If you have related questions about distributed crawling you can ask here. I tried to be concise. This could end up as a 5,000+ word essay.

TheAlmightyDada · Feb 1, 2018

This looks pretty comprehensive, thanks a lot. Commenting now so I don't lose it as I'm on my phone.

Cititechno · Feb 2, 2018

awesome guide!

pasdoy · Feb 2, 2018

Thanks, let me know if you have questions on a specific section.

jamie3000 · Feb 3, 2018

Excellent share mate. This will go over the heads of 99% of the people on BHW though. Still, it's nice to see someone contributing real technical expertise and not just "methods".

ScrapeboxWorker · Feb 12, 2018

Take a look at this project, may be helpfull https://github.com/gocolly/colly
Also https://www.docker.com/

pasdoy · Feb 13, 2018

nice lib! Perso I am not a docker fan when a plain exe can do the job. For sure it can be handy to easily scale. I didn't know about colly. For the scope of this project I preferred to handle the crawling to be sure it's optimized for my need, speed and cpu. For the sake of scaling I don't want the crawler to check robots.txt by itself. I'll check if I can just give it a request.body and it can extract all the links with anchor. Thanks for contributing.

XoC-- · Feb 14, 2018

Cool tutorial, appreciate the time it would take to test this.

pasdoy said:
- I tested this in production for 2 weeks. Ran out of dollar to continue since it was done only to post here.

How much did this cost you to run for 2 weeks? This must of cost quite a lot running on AWS.

ScrapeboxWorker · Feb 14, 2018

pasdoy said:
nice lib! Perso I am not a docker fan when a plain exe can do the job. For sure it can be handy to easily scale. I didn't know about colly. For the scope of this project I preferred to handle the crawling to be sure it's optimized for my need, speed and cpu. For the sake of scaling I don't want the crawler to check robots.txt by itself. I'll check if I can just give it a request.body and it can extract all the links with anchor. Thanks for contributing.

Yes but they can be easier to deploy, check https://aws.amazon.com/getting-started/tutorials/deploy-docker-containers/

pasdoy · Feb 15, 2018

XoC-- said:
Cool tutorial, appreciate the time it would take to test this.

How much did this cost you to run for 2 weeks? This must of cost quite a lot running on AWS.

Not so bad for 2 weeks, 120$ of lightsail, 50$ S3 and ~250$ of EMR with spot instances, reducing the cost by ~70%.

styleworker · Sep 26, 2018

Thanks for sharing your insights. Really appreciate it!

I've investigated some pub/sub system during my research. At the end I've decided to use RabbitMQ because it has some nice features which simplifies our architecture. Have you thought about using NATS? It's way faster than RabbitMQ and very stable. The only downside is the fire-and-forget methodology. You must ensure that your publisher and subscriber are working.

As an alternative to storing logs in S3 you could use TImescaleDB (extension of Postgres). I haven't used it so far but it looks very promising.

tightmetrics · Sep 28, 2018

Interesting. How do you defend yourself against workers failing in unexpected ways- say your VPS kernel panics and the workers don't have the chance to notify the job manager that something has gone wrong?

Keval Kothari · Sep 29, 2018

Hi,

Thank you for sharing this. It is very useful and descriptive. We are using python for data scraping and data crawling.

mainceaft · Oct 3, 2018

thanks for share (even the and the mode are bit complicated to me ) , I ask if you can develop a simple php script let's say crewl 100 page/day (speed doesn't matter too much for me ) . but I need it simple easy to use API or at least classes , where I can input target site and it contents div and (categories name ) and store that on sqlite DB to lather use it with spun software that post article on different blog

itz_styx · Oct 3, 2018

to archive higher speed you should use a crawler made in C as all these scripting languages are always much slower than compiled code. the next thing is storing all the TB's of data as thats whats going to get expensive.

SEOMadHatter · Oct 3, 2018

If I remember rightly RMQ was free but Datadog had a charge that I never really understood the client paying for since a fairly simple script would do more or less the same job.

I'm assuming anyone using this kind of thing knows what they're doing but I will mention going wide over different domains is one thing. If you put these kinds of numbers through a single source (I.E try and build yourself a live Amazon feed of their entire category) you run into external problems.

Scraping is a bit grey area but if you put enough stress on a host that they noticeably need to support your scrapes they can come after you for it. It's happened in the past.

If you're doing something like this with a single source work in some kind of caching so you're not hitting them with more requests than you have to and be reasonable with your expectations. I do know a guy who wanted a pretty much live feed of Amazon prices from a *lot* of listings. Technically possible - but begging for trouble.

pasdoy · Oct 6, 2018

tightmetrics said:
Interesting. How do you defend yourself against workers failing in unexpected ways- say your VPS kernel panics and the workers don't have the chance to notify the job manager that something has gone wrong?

Since I have the AWS integration with Datadog I could setup an email alert, Slack alert, etc depending on missing process name or maxed ram. You can do neat graphics with custom counters too. There are other services around like it if you would prefer.

itz_styx said:
to archive higher speed you should use a crawler made in C as all these scripting languages are always much slower than compiled code. the next thing is storing all the TB's of data as thats whats going to get expensive.

Yes you are right. However Go is a compiled language that offers a good concurrency pattern and is fairly simple to use with great libraries available. If you are referring to the Python part, I've done my mapreduces mostly in either Java, Scala or Python. I do not think they support C, C++ or Rust. I like the Python library Mrjob a lot, it saves devops time and offers interesting goodies to debug locally and bootstrap the Hadoop cluster & jobs. I would advise against building a similar pipeline just to use C or C++, time wise and money wise.

SEOMadHatter said:
If I remember rightly RMQ was free but Datadog had a charge that I never really understood the client paying for since a fairly simple script would do more or less the same job.

I'm assuming anyone using this kind of thing knows what they're doing but I will mention going wide over different domains is one thing. If you put these kinds of numbers through a single source (I.E try and build yourself a live Amazon feed of their entire category) you run into external problems.

Scraping is a bit grey area but if you put enough stress on a host that they noticeably need to support your scrapes they can come after you for it. It's happened in the past.

If you're doing something like this with a single source work in some kind of caching so you're not hitting them with more requests than you have to and be reasonable with your expectations. I do know a guy who wanted a pretty much live feed of Amazon prices from a *lot* of listings. Technically possible - but begging for trouble.

RMQ is free yes, but hosting it and supporting it can be a pain. Using a similar system in prod I would try to use a better solution. But sometimes the cost can out weight this if there's billion of messages for example. Datadog is free up to 5 host or something like that. What is interesting is the out of the box integrations they have with GCP, AWS, Pagerduty, Email, Databases to create alerts on your system status. Building & hooking up all this could be a project in itself. You can check my first reply here for an alert example that could save your system. For the last part I completely agree, crawling too aggressively will raise suspicion and start the cat and mouse game.

tightmetrics · Oct 6, 2018

I don't like the idea of using RabbitMQ here, from a recovery perspective. Yes, you can monitor when a server goes down, but that doesn't mean you can recover from what happened easily- some sites might just be missed for your next collapse into a file loaded into Hadoop, and then you're just missing those sites for that day/hour.

What if you chose a beefy PostgreSQL server that each crawler could connect to, select and lock some rows of sites to crawl, update them to have status='crawling' and another field that contains the hostname of the server- which can be used with DataDog to reset the status of each of those rows when a server goes down. This way one of your other VPS can pick up the slack, and unless something drastically terrible happens that will wipe out all of your VPS, you'll generally be okay.

pasdoy · Oct 6, 2018

tightmetrics said:
I don't like the idea of using RabbitMQ here, from a recovery perspective. Yes, you can monitor when a server goes down, but that doesn't mean you can recover from what happened easily- some sites might just be missed for your next collapse into a file loaded into Hadoop, and then you're just missing those sites for that day/hour.

What if you chose a beefy PostgreSQL server that each crawler could connect to, select and lock some rows of sites to crawl, update them to have status='crawling' and another field that contains the hostname of the server- which can be used with DataDog to reset the status of each of those rows when a server goes down. This way one of your other VPS can pick up the slack, and unless something drastically terrible happens that will wipe out all of your VPS, you'll generally be okay.

I see your point. This could help you https://softwareengineering.stackexchange.com/questions/231410/why-database-as-queue-so-bad. RMQ will put back in queue non ACKed messages so in case one worker fails the messages aren't lost. If RMQ fails, well it depends on your configuration. If messages are persistent it won't be a problem I think https://www.rabbitmq.com/persistence-conf.html.

tightmetrics · Oct 6, 2018

pasdoy said:
I see your point. This could help you https://softwareengineering.stackexchange.com/questions/231410/why-database-as-queue-so-bad. RMQ will put back in queue non ACKed messages so in case one worker fails the messages aren't lost. If RMQ fails, well it depends on your configuration. If messages are persistent it won't be a problem I think https://www.rabbitmq.com/persistence-conf.html.

Great to know! I had no idea RMQ had a persistence layer.

A small guide to distributed crawling

Senior Member

Regular Member

BANNED

Senior Member

Elite Member

Regular Member

Senior Member

Regular Member

Regular Member

Senior Member

Newbie

Junior Member

Newbie

Senior Member

Elite Member

Elite Member

Senior Member

Junior Member

Senior Member

Junior Member

Main Menu

Marketplace

Making Money

BlackHat World