Trying to scrape amazon

maxibaby

Junior Member
Joined
Apr 20, 2013
Messages
103
Reaction score
50
Hey guys, i'm trying to scrape amazon products.
Python:
Requests + Beautifulsoup4

For example...

amazon /dp/B00OZQZUJ6?psc=1
// Cant post link

Problem comes rigth when we want to scrape description:
Frame is populated directly by JS instead of having a source.

Anyone knows how could i get that?
At least understand what shall i emulate to get the info ?
How can i know whats the JS populating it

Thanks
 
Either study the code and reconstruct what it does client-side, or use a full browser (like PhantomJS).
 
Hello,

Any reading material you would recommend me ?
 
Just read, and already installed and trying some things.
Thanks

Just want to ask.
About what I understood, the main problem by using requests, is that since is not a real browser, it doesn't allow JavaScript from executing,
By the other way, this one does, so now the new generated conter should be there.

And actually there is, since i just made a screenshoot with Selenium+P,JS Thanks!
 
I have another question:

I already had some things made in beautifulsoup+requests.
So basically... if i get the content now with PhantomJS,
my beautifulsoup should work the same... But it doesnt(?)
Any of the selectors i had are working. But if i fetch the site with requests it does.

How is that possible.

Actually it looks like Beautifulsoup is only trying to find in the header. Not the complete body ?

Or should i be using PhantomJS for scraping also:
title = browser.find_element_by_css_selector('#productTitle').text
 
Last edited:
Well, i made it work with Selenium selectors.
But i don't understand why beautifulsoup wasn't working anymore, and was just trying to find in head.
Maybe i'm missing some theory. Do you know?
 
the product description is inside the code
try some regex like /var iframeContent = "(.*?)"/ to get it
then urldecode that
 
It wasn't even matching anything outside the Head.
But w/e.

I made it to work, if anyone interested in the code, give me a PM

You can put a search link and he create a new folder for each item and download all images and get the product description, title and features in an html file.

Half working sincelooks like amazon have 2 type of sites, with diff classes, so just gotta modify to handle both
 
Whenever your using BeautifulSoup if you want to get content that is generated by javascript just use a windowless browser such as PhantomJs
 
IIRC Amazon used to offer an XML API you could use. Scraping their site was unnecessary since that became available. Donno if it still exists though.
 
use html5lib for scrapping via BeautifuSoup

also use Selenium find elements by xpath
and see what you actually want to grab or click by cross checking its(object) .txt & .get_attributes
 
Scraping Amazon is a little difficult, they have a lot of anti-scrapping software. We figured out a way around it using proxy rotators - Maybe our developer can give you some consulting on the project if you're interested.

Send a PM if interested
 
Boom 5 minutes with NodeJS, time to switch:

Code:
const osmosis = require('osmosis');
const header = {
  "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  "accept-encoding": "br;q=1.0, gzip;q=0.8, *;q=0.1",
  "accept-language": "en,en-US,en-CA;q=0.9",
  "cache-control": "public,stale-while-revalidate,must-revalidate, max-age=86400",
  "content-security-policy": "block-all-mixed-content; object-src 'none'; worker-src 'none'; img-src https:;default-src https:",
  "referer": "https://www.amazon.com",
  "referrer-policy": "origin",
  "strict-transport-security": "max-age=31536000; includeSubDomains; preload;",
  "upgrade-insecure-requests": "1",
  "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
  "x-content-type-options": "nosniff",
  "x-dns-prefetch-control": "on",
  "x-frame-options": "SAMEORIGIN",
  "x-xss-protection": "1"
};

product_description('https://www.amazon.com/dp/B00OZQZUJ6?psc=1');


function product_description(url) {
  osmosis
  .get(url)
  .config('headers', header)
  .click('#productDescription_feature_div > h2')
  .set({
    'description': '#productDescription p',
  })
  .data(function(data) {
    console.log(`Product description: \n\n\n${data.description}`);
  })
}
 
Boom 5 minutes with NodeJS, time to switch:

Code:
const osmosis = require('osmosis');
const header = {
  "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  "accept-encoding": "br;q=1.0, gzip;q=0.8, *;q=0.1",
  "accept-language": "en,en-US,en-CA;q=0.9",
  "cache-control": "public,stale-while-revalidate,must-revalidate, max-age=86400",
  "content-security-policy": "block-all-mixed-content; object-src 'none'; worker-src 'none'; img-src https:;default-src https:",
  "referer": "https://www.amazon.com",
  "referrer-policy": "origin",
  "strict-transport-security": "max-age=31536000; includeSubDomains; preload;",
  "upgrade-insecure-requests": "1",
  "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
  "x-content-type-options": "nosniff",
  "x-dns-prefetch-control": "on",
  "x-frame-options": "SAMEORIGIN",
  "x-xss-protection": "1"
};

product_description('https://www.amazon.com/dp/B00OZQZUJ6?psc=1');


function product_description(url) {
  osmosis
  .get(url)
  .config('headers', header)
  .click('#productDescription_feature_div > h2')
  .set({
    'description': '#productDescription p',
  })
  .data(function(data) {
    console.log(`Product description: \n\n\n${data.description}`);
  })
}

can osmosis can get executed javascript ?
 
Back
Top