1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Trying to scrape amazon

Discussion in 'Other Languages' started by maxibaby, Feb 17, 2015.

  1. maxibaby

    maxibaby Junior Member

    Joined:
    Apr 20, 2013
    Messages:
    103
    Likes Received:
    40
    Occupation:
    Student
    Location:
    Venezuela
    Hey guys, i'm trying to scrape amazon products.
    Python:
    Requests + Beautifulsoup4

    For example...

    amazon /dp/B00OZQZUJ6?psc=1
    // Cant post link

    Problem comes rigth when we want to scrape description:
    Frame is populated directly by JS instead of having a source.

    Anyone knows how could i get that?
    At least understand what shall i emulate to get the info ?
    How can i know whats the JS populating it

    Thanks
     
  2. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,642
    Likes Received:
    11,355
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    Either study the code and reconstruct what it does client-side, or use a full browser (like PhantomJS).
     
  3. maxibaby

    maxibaby Junior Member

    Joined:
    Apr 20, 2013
    Messages:
    103
    Likes Received:
    40
    Occupation:
    Student
    Location:
    Venezuela
    Hello,

    Any reading material you would recommend me ?
     
  4. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,642
    Likes Received:
    11,355
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
  5. maxibaby

    maxibaby Junior Member

    Joined:
    Apr 20, 2013
    Messages:
    103
    Likes Received:
    40
    Occupation:
    Student
    Location:
    Venezuela
    Just read, and already installed and trying some things.
    Thanks

    Just want to ask.
    About what I understood, the main problem by using requests, is that since is not a real browser, it doesn't allow JavaScript from executing,
    By the other way, this one does, so now the new generated conter should be there.

    And actually there is, since i just made a screenshoot with Selenium+P,JS Thanks!
     
  6. jazzc

    jazzc Moderator Staff Member Moderator Jr. VIP

    Joined:
    Jan 27, 2009
    Messages:
    2,642
    Likes Received:
    11,355
    Occupation:
    Pusillanimous Knitter
    Location:
    Buenos Aires
    Exactly, glad you figured it out :)
     
  7. maxibaby

    maxibaby Junior Member

    Joined:
    Apr 20, 2013
    Messages:
    103
    Likes Received:
    40
    Occupation:
    Student
    Location:
    Venezuela
    I have another question:

    I already had some things made in beautifulsoup+requests.
    So basically... if i get the content now with PhantomJS,
    my beautifulsoup should work the same... But it doesnt(?)
    Any of the selectors i had are working. But if i fetch the site with requests it does.

    How is that possible.

    Actually it looks like Beautifulsoup is only trying to find in the header. Not the complete body ?

    Or should i be using PhantomJS for scraping also:
    title = browser.find_element_by_css_selector('#productTitle').text
     
    Last edited: Feb 17, 2015
  8. maxibaby

    maxibaby Junior Member

    Joined:
    Apr 20, 2013
    Messages:
    103
    Likes Received:
    40
    Occupation:
    Student
    Location:
    Venezuela
    Well, i made it work with Selenium selectors.
    But i don't understand why beautifulsoup wasn't working anymore, and was just trying to find in head.
    Maybe i'm missing some theory. Do you know?
     
  9. sockpuppet

    sockpuppet Junior Member

    Joined:
    Nov 7, 2011
    Messages:
    155
    Likes Received:
    145
    the product description is inside the code
    try some regex like /var iframeContent = "(.*?)"/ to get it
    then urldecode that
     
  10. maxibaby

    maxibaby Junior Member

    Joined:
    Apr 20, 2013
    Messages:
    103
    Likes Received:
    40
    Occupation:
    Student
    Location:
    Venezuela
    It wasn't even matching anything outside the Head.
    But w/e.

    I made it to work, if anyone interested in the code, give me a PM

    You can put a search link and he create a new folder for each item and download all images and get the product description, title and features in an html file.

    Half working sincelooks like amazon have 2 type of sites, with diff classes, so just gotta modify to handle both
     
  11. TheVegan

    TheVegan Junior Member

    Joined:
    Mar 6, 2013
    Messages:
    179
    Likes Received:
    33
    Occupation:
    blackhat
    Location:
    Prague
    Whenever your using BeautifulSoup if you want to get content that is generated by javascript just use a windowless browser such as PhantomJs
     
  12. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    12,748
    Likes Received:
    11,414
    Occupation:
    COINZ
    Location:
    BUYAH
    Home Page:
    IIRC Amazon used to offer an XML API you could use. Scraping their site was unnecessary since that became available. Donno if it still exists though.
     
  13. sohom

    sohom Senior Member

    Joined:
    May 26, 2013
    Messages:
    990
    Likes Received:
    175
    Location:
    not in Past
    use html5lib for scrapping via BeautifuSoup

    also use Selenium find elements by xpath
    and see what you actually want to grab or click by cross checking its(object) .txt & .get_attributes
     
  14. mister_digital

    mister_digital Junior Member

    Joined:
    Jun 22, 2016
    Messages:
    102
    Likes Received:
    6
    Scraping Amazon is a little difficult, they have a lot of anti-scrapping software. We figured out a way around it using proxy rotators - Maybe our developer can give you some consulting on the project if you're interested.

    Send a PM if interested