How to write a web crawler from scratch with Proxy support

Gazo · Jul 26, 2019

Overview

Most Python web crawling/scraping tutorials use some kind of crawling library. This is great if you want to get things done quickly, but if you do not understand how scraping works under the hood then when problems arise it will be difficult to know how to fix them.

In this tutorial I will be going over how to write a web crawler completely from scratch in Python using only the Python Standard Library and the requests module (https://pypi.org/project/requests/2.7.0/). I will also be going over how you can use a proxy API (https://proxyorbit.com) to prevent your crawler from getting blacklisted.

This is mainly for educational purposes, but with a little attention and care this crawler can become as robust and useful as any scraper written using a library. Not only that, but it will most likely be lighter and more portable as well.

I am going to assume that you have a basic understanding of Python and programming in general. Understanding of how HTTP requests work and how Regular Expressions work will be needed to fully understand the code. I won't be going into deep detail on the implementation of each individual function. Instead, I will give high level overviews of how the code samples work and why certain things work the way they do.

The crawler that we'll be making in this tutorial will have the goal of "indexing the internet" similar to the way Google's crawlers work. Obviously we won't be able to index the internet, but the idea is that this crawler will follow links all over the internet and save those links somewhere as well as some information on the page.

Start Small

The first task is to set the groundwork of our scraper. We're going to use a class to house all our functions. We'll also need the re and requests modules so we'll import them

Code:

import requests   
import re   
  
class PyCrawler(object):   
   def __init__(self, starting_url):   
       self.starting_url = starting_url                     
       self.visited = set()   
          
   def start(self):   
       pass               
                          
if __name__ == "__main__":   
   crawler = PyCrawler()   
   crawler.start()

You can see that this is very simple to start. It's important to build these kinds of things incrementally. Code a little, test a little, etc.

We have two instance variables that will help us in our crawling endeavors later.

Code:

starting_url

Is the initial starting URL that our crawler will start out

Code:

visited

This will allow us to keep track of the URLs that we have currently visited to prevent visiting the same URL twice. Using a set() keeps visited URL lookup in O(1) time making it very fast.

[size=x-large]Crawl Sites[/size]

Now we will get started actually writing the crawler. The code below will make a request to the starting_url and extract all links on the page. Then it will iterate over all the new links and gather new links from the new pages. It will continue this recursive process until all links have been scraped that are possible from the starting point. Some websites don't link outside of themselves so these sites will stop sooner than sites that do link to other sites.

Code:

import requests   
import re   
from urllib.parse import urlparse   
  
class PyCrawler(object):   
   def __init__(self, starting_url):   
       self.starting_url = starting_url   
       self.visited = set()   
  
   def get_html(self, url):   
       try:   
           html = requests.get(url)   
       except Exception as e:   
           print(e)   
           return ""   
       return html.content.decode('latin-1')   
  
   def get_links(self, url):   
       html = self.get_html(url)   
       parsed = urlparse(url)   
       base = f"{parsed.scheme}://{parsed.netloc}"   
       links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)   
       for i, link in enumerate(links):   
           if not urlparse(link).netloc:   
               link_with_base = base + link   
               links[i] = link_with_base     
                                              
       return set(filter(lambda x: 'mailto' not in x, links))   
                                                              
   def extract_info(self, url):                               
       html = self.get_html(url)                             
       return None                 
                                  
   def crawl(self, url):                 
       for link in self.get_links(url):   
           if link in self.visited:       
               continue                   
           print(link)               
           self.visited.add(link)           
           info = self.extract_info(link)   
           self.crawl(link)                 
                                            
   def start(self):                   
       self.crawl(self.starting_url)   
                                      
if __name__ == "__main__":                         
   crawler = PyCrawler("https://google.com")       
   crawler.start()

As we can see a fair bit of new code has been added.

To start, get_html, get_links, crawl, and extract_info methods were added.

Code:

get_html()

Is used to get the HTML at the current link

Code:

get_links()

Extracts links from the current page

Code:

extract_info()

Will be used to extract specific info on the page.

the

Code:

crawl()

function has also been added and it is probably the most important and complicated piece of this code. "crawl" works recursively. It starts at the start_url, extracts links from that page, iterates over those links, and then feeds the links back into itself recursively.

If you think of the web like a series of doors and rooms, then essentially what this code is doing is looking for those doors and walking through them until it gets to a room with no doors. When this happens it works its way back to a room that has unexplored doors and enters that one. It does this forever until all doors accessible from the starting location have been accessed. This kind of process lends itself very nicely to recursive code.

If you run this script now as is it will explore and print all the new URLs it finds starting from google.com

Extract Content

Now we will extract data from the pages. This method (extract_info) is largely based on what you are trying to do with your scraper. For the sake of this tutorial, all we are going to do is extract meta tag information if we can find it on the page.

Code:

import requests   
import re   
from urllib.parse import urlparse   
  
class PyCrawler(object):   
   def __init__(self, starting_url):   
       self.starting_url = starting_url   
       self.visited = set()   
  
   def get_html(self, url):   
       try:   
           html = requests.get(url)   
       except Exception as e:   
           print(e)   
           return ""   
       return html.content.decode('latin-1')   
  
   def get_links(self, url):   
       html = self.get_html(url)   
       parsed = urlparse(url)   
       base = f"{parsed.scheme}://{parsed.netloc}"   
       links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)   
       for i, link in enumerate(links):   
           if not urlparse(link).netloc:   
               link_with_base = base + link   
               links[i] = link_with_base   
  
       return set(filter(lambda x: 'mailto' not in x, links))   
  
   def extract_info(self, url):   
       html = self.get_html(url)   
       meta = re.findall("<meta .*?name=[\"'](.*?)['\"].*?content=[\"'](.*?)['\"].*?>", html)   
       return dict(meta)   
  
   def crawl(self, url):   
       for link in self.get_links(url):   
           if link in self.visited:   
               continue   
           self.visited.add(link)   
           info = self.extract_info(link)   
  
           print(f"""Link: {link}   
Description: {info.get('description')}   
Keywords: {info.get('keywords')}   
           """)   
  
           self.crawl(link)   
  
   def start(self):   
       self.crawl(self.starting_url)   
  
if __name__ == "__main__":   
   crawler = PyCrawler("https://google.com")   
   crawler.start()

Not much has changed here besides the new print formatting and the extract_info method.

The magic here is in the regular expression in the extract_info method. It searches in the HTML for all meta tags that follow the format <meta name=X content=Y> and returns a Python dictionary of the format {X:Y}

This information is then printed to the screen for every URL for every request.

Integrate Rotating Proxy API

One of the main problems with web crawling and web scraping is that sites will ban you either if you make too many requests, don't use an acceptable user agent, etc. One of the ways to limit this is by using proxies and setting a different user agent for the crawler. Normally the proxy approach requires you to go out and purchase or source manually a list of proxies from somewhere else. A lot of the time these proxies don't even work or are incredibly slow making web crawling much more difficult.

To avoid this problem we are going to be using what is called a "rotating proxy API". A rotating proxy API is an API that takes care of managing the proxies for us. All we have to do is make a request to their API endpoint and boom, we'll get a new working proxy for our crawler. Integrating the service into the platform will require no more than a few extra lines of Python.

The service we will be using is Proxy Orbit (https://proxyorbit.com). Full disclosure, I do own and run Proxy Orbit. The service specializes in creating proxy solutions for web crawling applications. The proxies are checked continually to make sure that only the best working proxies are in the pool.

Code:

import requests   
import re   
from urllib.parse import urlparse   
import os   
  
class PyCrawler(object):   
   def __init__(self, starting_url):   
       self.starting_url = starting_url   
       self.visited = set()   
       self.proxy_orbit_key = os.getenv("PROXY_ORBIT_TOKEN")   
       self.user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"   
       self.proxy_orbit_url = f"https://api.proxyorbit.com/v1/?token={self.proxy_orbit_key}&ssl=true&rtt=0.3&protocols=http&lastChecked=30"   
                                                                                                                                        
   def get_html(self, url):                                                                                                             
       try:                                                                                                                             
           proxy_info = requests.get(self.proxy_orbit_url).json()                                                                       
           proxy = proxy_info['curl']                                                                                               
           html = requests.get(url, headers={"User-Agent":self.user_agent}, proxies={"http":proxy, "https":proxy}, timeout=5)       
       except Exception as e:                                                                                                       
           print(e)                                                                                                                 
           return ""                                                                                                               
       return html.content.decode('latin-1')                                                                                       
  
   def get_links(self, url):   
       html = self.get_html(url)   
       parsed = urlparse(url)   
       base = f"{parsed.scheme}://{parsed.netloc}"   
       links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)   
       for i, link in enumerate(links):   
           if not urlparse(link).netloc:   
               link_with_base = base + link   
               links[i] = link_with_base   
  
       return set(filter(lambda x: 'mailto' not in x, links))   
  
   def extract_info(self, url):   
       html = self.get_html(url)   
       meta = re.findall("<meta .*?name=[\"'](.*?)['\"].*?content=[\"'](.*?)['\"].*?>", html)   
       return dict(meta)   
  
   def crawl(self, url):   
       for link in self.get_links(url):   
           if link in self.visited:   
               continue   
           self.visited.add(link)   
           info = self.extract_info(link)   
  
           print(f"""Link: {link}   
Description: {info.get('description')}   
Keywords: {info.get('keywords')}   
           """)   
  
           self.crawl(link)   
  
   def start(self):   
       self.crawl(self.starting_url)   
  
if __name__ == "__main__":   
   crawler = PyCrawler("https://google.com/")   
   crawler.start()

As you can see, not much has really changed here. Three new class variables were created: proxy_orbit_key, user_agent, and proxy_orbit_url

proxy_orbit_key gets the Proxy Orbit API Token from an environment variable named "PROXY_ORBI

OKEN"

user_agent sets the User Agent of the crawler to Firefox to make requests look like they are coming from a browser

proxy_orbit_url is the Proxy Orbit API endpoint that we will be hitting. We will be filtering our results only requesting HTTP proxies supporting SSL that have been checked in the last 30 minutes.

in get_html a new HTTP request is being made to the Proxy Orbit API URL to get the random proxy and insert it into the requests module for grabbing the URL we are trying to crawl from behind a proxy.

If all goes well then that's it! We should now have a real working web crawler that pulls data from web pages and supports rotating proxies.

If you have any questions either feel free to comment below or send me a PM

PIEGE · Jul 29, 2019

pay attention. some nice work being done here.

Gazo · Jul 31, 2019

PIEGE said:
pay attention. some nice work being done here.

Thank you for the kind works, much appreciated.

soulSEO · Aug 1, 2019

First appreciate for nice tutorial!

I wanna make a website that takes content from another site (not mine) and upload automatically every day.

Is this script working for my purpose?

nextlvlig · Aug 1, 2019

When I try to go to the proxy_orbit_url link, it says the token is invalid. Is this suppose to happen?

zaogord · Aug 1, 2019

soulSEO said:
First appreciate for nice tutorial!

I wanna make a website that takes content from another site (not mine) and upload automatically every day.

Is this script working for my purpose?

https://www.reddit.com/r/learnpython/comments/2mroka/easiest_way_to_get_python_script_to_run_on_webpage/

https://www.youtube.com/watch?v=ERMRVORGvZM

Purush · Aug 1, 2019

IM wont complete without scraping . thanks

Gazo · Aug 1, 2019

nextlvlig said:
When I try to go to the proxy_orbit_url link, it says the token is invalid. Is this suppose to happen?

Did you set the

Code:

PROXY_ORBIT_TOKEN

environment variable?

Gazo · Aug 1, 2019

soulSEO said:
First appreciate for nice tutorial!

I wanna make a website that takes content from another site (not mine) and upload automatically every day.

Is this script working for my purpose?

With a bit of tweaking this can do that, but not out of the box.

Gogol · Aug 1, 2019

Good read. Bookmarked for later. Nice to see someone explaining stuffs with complexity (i.e. the big O). That is pretty important for scaling it up.

Sure I can pick up something from here.

Quick question (sorry I didn't read past the first paragraph yet, u might already have described it).. Do you know of a way to use requests to create a js enabled crawler? Ofcourse I can use selenium, but that takes wayyy more memory than what requests does.

Mysteriousrobyn · Aug 1, 2019

Isnt crawling the web illegal?

Gogol · Aug 1, 2019

Mysteriousrobyn said:
Isnt crawling the web illegal?

Why would it be? If that was the case, the site won't even be public, isn't it?

Mysteriousrobyn · Aug 1, 2019

Gogol said:
Why would it be? If that was the case, the site won't even be public, isn't it?

https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/

Gogol · Aug 1, 2019

Mysteriousrobyn said:
https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/

Umm, sorry I won't copy paste that while i am mobile. Especially because you are still new to the forum.

The point is, that a crawler is nothing but a visitor; only a bot visitor in this case. It may be against the term to scrape some site... But then... Site term is not law.
Google made a huge business out of it. Still do you think it is illegal?

Gazo · Aug 1, 2019

Gogol said:
Good read. Bookmarked for later. Nice to see someone explaining stuffs with complexity (i.e. the big O). That is pretty important for scaling it up.

Sure I can pick up something from here.

Quick question (sorry I didn't read past the first paragraph yet, u might already have described it).. Do you know of a way to use requests to create a js enabled crawler? Ofcourse I can use selenium, but that takes wayyy more memory than what requests does.

I did not describe it in the post, so good question. What I normally do in these cases is use the PhantomJS web driver in Selenium. PhantomJS is a lightweight headless browser that renders Javascript, so you get the best of both worlds. A light weight scraper, and the power of Selenium.

Gazo · Aug 1, 2019

Mysteriousrobyn said:
Isnt crawling the web illegal?

It's not illegal. There is a protocol called robots.txt that literally tells your crawler which sites it can crawl on the website. Even Blackhat World has a robots.txt file (https://www.blackhatworld.com/robots.txt)

It only becomes illegal when you STEAL information from other websites. For example, if a website is hosting premium content and you make a crawler that scrapes the premium content and puts it on your website to be sold without permission then that is illegal. But the act of scraping the web is not illegal. In some ways it is encouraged because of getting ranked within search engines.

Gogol · Aug 1, 2019

Gazo said:
I did not describe it in the post, so good question. What I normally do in these cases is using the PhantomJS web driver in Selenium. PhantomJS is a lightweight headless browser that rendering Javascript. So you get the best of both worlds. A light weight scraper, and the power of Selenium.

Gotta try it really. I always either use chrome driver (if crawling) or Firefox driver (if unit testing.. I also use chrome driver in this case). Never really used phantomjs yet. Thanks for the tip.

How to write a web crawler from scratch with Proxy support

Gazo

Newbie

PIEGE

Newbie

Gazo

Newbie

soulSEO

Regular Member

nextlvlig

Regular Member

zaogord

BANNED

Purush

Elite Member

Gazo

Newbie

Gazo

Newbie

Gogol

Administrator

Mysteriousrobyn

Registered Member

Gogol

Administrator

Mysteriousrobyn

Registered Member

Gogol

Administrator

Gazo

Newbie

Gazo

Newbie

Gogol

Administrator

Main Menu

Marketplace

Making Money

BlackHat World