Amazon Scraper Issues

Pecola · Aug 23, 2016

Hi guys,

I created an Amazon scraper to scrape the product pages of the top selling Amazon products in a category of my choice (ex. Baby Products), including the top selling products in the subcategories.

However, I've been running into an issue where Amazon keeps sending me CAPTCHAs almost as soon as I fire up my scraper (within the first 3-5 pages I scrape). I've taken what I believe to be the necessary steps to avoid detection, such as using Tor as a sort of proxy to mask my IP, and requesting a new Tor exit node after every page I scrape, thereby effectively rotating the IP address. I've also delayed my scraper to scrape a page every 20 - 40 seconds (the actual delay is randomized every time I scrape a page), and I've randomized the order of the pages scraped to a certain degree.

I've doubled checked that my requests are being filtered through Tor, and that I'm indeed getting a new Tor exit node when I request one. I'm at a loss as to how Amazon is able to detect that my scraper and send CAPTCHAs so quickly. I run the scraper a few times a week and I keep seeing the same results. The only possible way I can tell that Amazon is detecting that I'm a scraper is that they have figured out the general pattern in which my scraper scrapes pages. Although I sort of randomized the scraping order, it can still be improved.

Does anyone have any other suggestions or advice? This is driving me nuts!

CialisBilligInternet · Aug 23, 2016

Cookies?

Pecola · Aug 23, 2016

CialisBilligInternet said:
Cookies?

I'm using the Scrapy library from Python, and I've explicitly set the Cookies to not be enabled in my spider's/scraper's settings.

Amazon Scraper Issues

Pecola

Newbie

CialisBilligInternet

BANNED

Pecola

Newbie

Main Menu

Marketplace

Making Money

BlackHat World