1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Best Languages for Web Scraping

Discussion in 'General Programming Chat' started by abd0gheist, Dec 19, 2015.

  1. abd0gheist

    abd0gheist Newbie

    Joined:
    Dec 19, 2015
    Messages:
    2
    Likes Received:
    0
    Hello all,

    As my title suggests, I'm wondering what type of programming languages are best for developing web scrapers? I don't know much about developing or using them, so please pardon the vagueness of my question. Is there particular advantages to using a general-purpose language such as Python for building a scraper, versus a more specialized language?
     
  2. kahuna74

    kahuna74 Regular Member

    Joined:
    Aug 19, 2014
    Messages:
    270
    Likes Received:
    102
    Gender:
    Male
    Occupation:
    Software Developer
    Location:
    Grand Rapids, MI
    Whatever language allows you to send HTTP requests and parse (x)html or json. Whatever you use to parse (x)html should be forgiving because there is a lot of badly formed html out there.

    I've used ruby, python, clojure, and even shell scripts to scrape. They all work pretty well.
     
  3. ekapek

    ekapek Jr. VIP Jr. VIP Premium Member

    Joined:
    Aug 2, 2010
    Messages:
    273
    Likes Received:
    48
    Home Page:
    There are many good scrapers available - you can try python scrapy
     
  4. timtamboy63

    timtamboy63 Newbie

    Joined:
    Dec 26, 2015
    Messages:
    18
    Likes Received:
    0
    Pretty much any language will have a web scraping package. I've personally used Ruby + Nokogiri myself, but Python has a few good ones too.
     
  5. Bane Bentley

    Bane Bentley Jr. VIP Jr. VIP

    Joined:
    Jun 13, 2013
    Messages:
    180
    Likes Received:
    35
    I use VB.Net without a problem.

    It all depends on how much time you have to invest in a given technology.
     
  6. Sheraf

    Sheraf Registered Member

    Joined:
    Jan 19, 2014
    Messages:
    61
    Likes Received:
    8
    python, use python-requests to query pages, and python-lxml to parse html (you can use xpath or cssselect to select element you'd like to extract)
     
  7. rubymooree

    rubymooree Registered Member

    Joined:
    Jul 16, 2011
    Messages:
    96
    Likes Received:
    13
    Occupation:
    In House SEO
    Python is a good programming to get started and it's very good at scraping
     
  8. blueboy121

    blueboy121 Newbie

    Joined:
    Oct 21, 2015
    Messages:
    37
    Likes Received:
    1
    I've been scraping using python's requests and beatifulSoup modules. Is there any benefit of using python Scrapy? Is it a software or like a module?
     
  9. immaletyoufinish

    immaletyoufinish Regular Member

    Joined:
    Mar 3, 2016
    Messages:
    219
    Likes Received:
    113
    There are plenty of choices. I often write scrapers in bash (shell scripts). I just use either cURL or wget to hit the url and download the page then as needed extract the content I want using regex with grep and sed. It's quick and dirty, but it's magic.

    I also use iMacros in combination with Javascript. I find once you learn the iMacros syntax they can be very fast to whip up.

    Another potential techstack you could do scraping with is Java + Selenium + phantomJS.

    The sky is the limit. A protip when writing a scraper for a given site is to hit F12 in your browser to bring up the dev tools then use the selection mode, hover over the text or image u are interest in scraping and the dev tools should give you an indication of what CSS selector you need to target to extract that bit of data.

    And if your writing scrapers using regex one gotcha to watch out for is greedy pattern matching. Newbies might find their neatly crafted regex matches the entire page because they ended it with a " or a >
     
    • Thanks Thanks x 1
  10. tompots

    tompots Elite Member Premium Member

    Joined:
    Dec 11, 2011
    Messages:
    4,371
    Likes Received:
    3,964
    Gender:
    Male
    Occupation:
    Full Time Bot Developer
    Location:
    Automation Alternatives
    Home Page:
    Here is a great list of tools that may interest you
    Code:
    https://www.quora.com/Which-are-some-of-the-best-web-data-scraping-tools
    https://www.google.com/#q=top+web+scraper
    
     
  11. collegeguys4

    collegeguys4 Junior Member

    Joined:
    Feb 25, 2016
    Messages:
    124
    Likes Received:
    15
    Occupation:
    computer engineer
    Location:
    New York
    Home Page:
    I have used Perl over the years to scrape many sites, to log into sites etc. With the prevalence of javascript on sites I have found using a headless browser like PhantomJS the best for getting at the final rendered page.
     
  12. Bahmer

    Bahmer Regular Member

    Joined:
    Jul 8, 2015
    Messages:
    261
    Likes Received:
    60
    This is good advice, for programming related questions though its best to look at places like stack overflow before asking on here.
     
  13. immaletyoufinish

    immaletyoufinish Regular Member

    Joined:
    Mar 3, 2016
    Messages:
    219
    Likes Received:
    113
    I tend to find that questions about scraping get heavily downvoted on SO. Mainly because in general as a developer if it's come to the point you have to ask something on stackoverflow it's usefully because it's a really challenging problem and you haven't been able to solve it yourself, so you ask for some help. Then along comes some clueless script kiddie who's like 'need to scrape 1000 porn urls pls help' and that kind of post amongst a group of professionals just comes off the wrong way.

    Though, if you are going to ask scraping related questions on stackoverflow, make them look more professional. Ask specifically about the library you are using for scraping (make sure to tag it with that tag too). If possible try not to use the word scraping. You could phrase question titles like 'having trouble targeting css selector' or 'regex accidentally matches the whole page, why?'. You will likely get help without getting downvoted this way.