1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Amazon Scraper Issues

Discussion in 'General Programming Chat' started by Pecola, Aug 23, 2016.

Tags:
  1. Pecola

    Pecola Newbie

    Joined:
    Aug 23, 2016
    Messages:
    11
    Likes Received:
    0
    Gender:
    Male
    Hi guys,

    I created an Amazon scraper to scrape the product pages of the top selling Amazon products in a category of my choice (ex. Baby Products), including the top selling products in the subcategories.

    However, I've been running into an issue where Amazon keeps sending me CAPTCHAs almost as soon as I fire up my scraper (within the first 3-5 pages I scrape). I've taken what I believe to be the necessary steps to avoid detection, such as using Tor as a sort of proxy to mask my IP, and requesting a new Tor exit node after every page I scrape, thereby effectively rotating the IP address. I've also delayed my scraper to scrape a page every 20 - 40 seconds (the actual delay is randomized every time I scrape a page), and I've randomized the order of the pages scraped to a certain degree.

    I've doubled checked that my requests are being filtered through Tor, and that I'm indeed getting a new Tor exit node when I request one. I'm at a loss as to how Amazon is able to detect that my scraper and send CAPTCHAs so quickly. I run the scraper a few times a week and I keep seeing the same results. The only possible way I can tell that Amazon is detecting that I'm a scraper is that they have figured out the general pattern in which my scraper scrapes pages. Although I sort of randomized the scraping order, it can still be improved.

    Does anyone have any other suggestions or advice? This is driving me nuts!
     
  2. CialisBilligInternet

    CialisBilligInternet BANNED BANNED

    Joined:
    Mar 12, 2009
    Messages:
    201
    Likes Received:
    40
    Gender:
    Male
    Cookies?
     
  3. Pecola

    Pecola Newbie

    Joined:
    Aug 23, 2016
    Messages:
    11
    Likes Received:
    0
    Gender:
    Male
    I'm using the Scrapy library from Python, and I've explicitly set the Cookies to not be enabled in my spider's/scraper's settings.