Trying to scrape amazon

maxibaby · Feb 17, 2015

Hey guys, i'm trying to scrape amazon products.
Python:
Requests + Beautifulsoup4

For example...

amazon /dp/B00OZQZUJ6?psc=1
// Cant post link

Problem comes rigth when we want to scrape description:
Frame is populated directly by JS instead of having a source.

Anyone knows how could i get that?
At least understand what shall i emulate to get the info ?
How can i know whats the JS populating it

Thanks

jazzc · Feb 17, 2015

Either study the code and reconstruct what it does client-side, or use a full browser (like PhantomJS).

maxibaby · Feb 17, 2015

Hello,

Any reading material you would recommend me ?

jazzc · Feb 17, 2015

Start with the documentation: http://phantomjs.org/documentation/

maxibaby · Feb 17, 2015

Just read, and already installed and trying some things.
Thanks

Just want to ask.
About what I understood, the main problem by using requests, is that since is not a real browser, it doesn't allow JavaScript from executing,
By the other way, this one does, so now the new generated conter should be there.

And actually there is, since i just made a screenshoot with Selenium+P,JS Thanks!

jazzc · Feb 17, 2015

Exactly, glad you figured it out

maxibaby · Feb 17, 2015

I have another question:

I already had some things made in beautifulsoup+requests.
So basically... if i get the content now with PhantomJS,
my beautifulsoup should work the same... But it doesnt(?)
Any of the selectors i had are working. But if i fetch the site with requests it does.

How is that possible.

Actually it looks like Beautifulsoup is only trying to find in the header. Not the complete body ?

Or should i be using PhantomJS for scraping also:
title = browser.find_element_by_css_selector('#productTitle').text

maxibaby · Feb 17, 2015

Well, i made it work with Selenium selectors.
But i don't understand why beautifulsoup wasn't working anymore, and was just trying to find in head.
Maybe i'm missing some theory. Do you know?

sockpuppet · Feb 17, 2015

the product description is inside the code
try some regex like /var iframeContent = "(.*?)"/ to get it
then urldecode that

maxibaby · Feb 18, 2015

It wasn't even matching anything outside the Head.
But w/e.

I made it to work, if anyone interested in the code, give me a PM

You can put a search link and he create a new folder for each item and download all images and get the product description, title and features in an html file.

Half working sincelooks like amazon have 2 type of sites, with diff classes, so just gotta modify to handle both

TheVegan · Feb 18, 2015

Whenever your using BeautifulSoup if you want to get content that is generated by javascript just use a windowless browser such as PhantomJs

bartosimpsonio · Feb 18, 2015

IIRC Amazon used to offer an XML API you could use. Scraping their site was unnecessary since that became available. Donno if it still exists though.

sohom · Mar 2, 2015

use html5lib for scrapping via BeautifuSoup

also use Selenium find elements by xpath
and see what you actually want to grab or click by cross checking its(object) .txt & .get_attributes

mister_digital · Jul 10, 2016

Scraping Amazon is a little difficult, they have a lot of anti-scrapping software. We figured out a way around it using proxy rotators - Maybe our developer can give you some consulting on the project if you're interested.

Send a PM if interested

heema92 · May 15, 2018

Boom 5 minutes with NodeJS, time to switch:

Code:

const osmosis = require('osmosis');
const header = {
  "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  "accept-encoding": "br;q=1.0, gzip;q=0.8, *;q=0.1",
  "accept-language": "en,en-US,en-CA;q=0.9",
  "cache-control": "public,stale-while-revalidate,must-revalidate, max-age=86400",
  "content-security-policy": "block-all-mixed-content; object-src 'none'; worker-src 'none'; img-src https:;default-src https:",
  "referer": "https://www.amazon.com",
  "referrer-policy": "origin",
  "strict-transport-security": "max-age=31536000; includeSubDomains; preload;",
  "upgrade-insecure-requests": "1",
  "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
  "x-content-type-options": "nosniff",
  "x-dns-prefetch-control": "on",
  "x-frame-options": "SAMEORIGIN",
  "x-xss-protection": "1"
};

product_description('https://www.amazon.com/dp/B00OZQZUJ6?psc=1');


function product_description(url) {
  osmosis
  .get(url)
  .config('headers', header)
  .click('#productDescription_feature_div > h2')
  .set({
    'description': '#productDescription p',
  })
  .data(function(data) {
    console.log(`Product description: \n\n\n${data.description}`);
  })
}

maxy007 · Dec 20, 2018

heema92 said:

Boom 5 minutes with NodeJS, time to switch:

Code:

const osmosis = require('osmosis');
const header = {
  "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  "accept-encoding": "br;q=1.0, gzip;q=0.8, *;q=0.1",
  "accept-language": "en,en-US,en-CA;q=0.9",
  "cache-control": "public,stale-while-revalidate,must-revalidate, max-age=86400",
  "content-security-policy": "block-all-mixed-content; object-src 'none'; worker-src 'none'; img-src https:;default-src https:",
  "referer": "https://www.amazon.com",
  "referrer-policy": "origin",
  "strict-transport-security": "max-age=31536000; includeSubDomains; preload;",
  "upgrade-insecure-requests": "1",
  "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
  "x-content-type-options": "nosniff",
  "x-dns-prefetch-control": "on",
  "x-frame-options": "SAMEORIGIN",
  "x-xss-protection": "1"
};

product_description('https://www.amazon.com/dp/B00OZQZUJ6?psc=1');


function product_description(url) {
  osmosis
  .get(url)
  .config('headers', header)
  .click('#productDescription_feature_div > h2')
  .set({
    'description': '#productDescription p',
  })
  .data(function(data) {
    console.log(`Product description: \n\n\n${data.description}`);
  })
}

can osmosis can get executed javascript ?

Trying to scrape amazon

maxibaby

Junior Member

jazzc

Elite Member

maxibaby

Junior Member

jazzc

Elite Member

maxibaby

Junior Member

jazzc

Elite Member

maxibaby

Junior Member

maxibaby

Junior Member

sockpuppet

Junior Member

maxibaby

Junior Member

TheVegan

Junior Member

bartosimpsonio

Elite Member

sohom

Supreme Member

mister_digital

Junior Member

heema92

Registered Member

maxy007

Regular Member

Main Menu

Marketplace

Making Money

BlackHat World