1. This website uses cookies to improve service and provide a tailored user experience. By using this site, you agree to this use. See our Cookie Policy.
    Dismiss Notice

How to start with Scraping

Discussion in 'General Programming Chat' started by Whits Simpson, Aug 22, 2017.

  1. Whits Simpson

    Whits Simpson Newbie

    Joined:
    Jul 11, 2017
    Messages:
    11
    Likes Received:
    3
    Gender:
    Male
    Hello everyone!

    I am student and I am learning PHP and Java, I've already know them at some good point. I would like to start with web-scraping so I would like to hear some tips and suggestions from you. What is the best way to start with scraping? What tools are you using? Are you making custom scripts or you are using some scraping software?
    How should I start?
     
  2. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP

    Joined:
    Mar 21, 2013
    Messages:
    15,342
    Likes Received:
    13,707
    Occupation:
    LOST MY HOUSE
    Location:
    IN CRYPTO
    I like your username.
     
    • Thanks Thanks x 2
  3. Whits Simpson

    Whits Simpson Newbie

    Joined:
    Jul 11, 2017
    Messages:
    11
    Likes Received:
    3
    Gender:
    Male
    Family members :)
     
    • Thanks Thanks x 1
  4. knaitas

    knaitas BANNED BANNED

    Joined:
    Jul 26, 2016
    Messages:
    176
    Likes Received:
    135
    Gender:
    Male
    echo file_get_contents('https://www.blackhatworld.com');

    is all you need bro
     
    • Thanks Thanks x 2
  5. Yildiz

    Yildiz Regular Member

    Joined:
    Mar 9, 2012
    Messages:
    443
    Likes Received:
    152
    Occupation:
    Software Engineer
    Location:
    Boston, MA
    If you've got a background in Java, there's a library called JSoup that I've used before.
    It really makes scraping the web easy and if you need to automate tasks you can use a framework called Selenium for Java.
    Although there aren't any "tools", these two resources will easily let you create any scraper / web automation tool you'd like.
     
    • Thanks Thanks x 1
  6. bluehatface

    bluehatface Regular Member

    Joined:
    Oct 19, 2013
    Messages:
    279
    Likes Received:
    118
    Location:
    Here
    Start with an idea of what you want to scrape. If it's a simple, non JS site, then you can CURL request the site, spoofing headers and user agent, and parse the result using regex or some clever string manipulation. If it's a more difficult site, have a look at selenium for Java or PHP, using Casperjs or phantomjs to create a headless browser.
     
    • Thanks Thanks x 1
  7. xrfanatic

    xrfanatic Jr. VIP Jr. VIP

    Joined:
    Aug 28, 2010
    Messages:
    424
    Likes Received:
    177
    Gender:
    Male
    Location:
    http://bit.ly/slb64
    Home Page:
    C# + Selenium WebDriver (Firefox) in my case does the job properly. If you work with Java and you are ok with gathering data through browser (what selenium does) , you can give it a try.
     
    • Thanks Thanks x 1
  8. Crawlie

    Crawlie Registered Member

    Joined:
    Jan 2, 2017
    Messages:
    52
    Likes Received:
    11
    Gender:
    Male
    For Java check out HtmlUnit. It's lightweight and fast, parses JavaScript generated content. JSoup only parses static html. You can also use Selenium, but that eats more resources. A basic scraping script for a site is only a few lines of code.
     
    • Thanks Thanks x 1
  9. Paybis123

    Paybis123 Newbie

    Joined:
    Aug 18, 2017
    Messages:
    4
    Likes Received:
    0
    Gender:
    Male
    You can use Import.io a free web scraper for non-programmers.I have used it before to scrape some forums.You can check this tutorial on how to use it.
     
  10. malayguru

    malayguru Regular Member

    Joined:
    Oct 29, 2012
    Messages:
    386
    Likes Received:
    60
    Gender:
    Male
    Occupation:
    Entrepreneur
    Location:
    Singapore
    what do you want to scrape? you can make use of free or paid scraper tools available, don't reinvent the wheel
     
  11. DrPorn

    DrPorn Junior Member

    Joined:
    Mar 20, 2016
    Messages:
    136
    Likes Received:
    195
    I highly recommend Python and Scrapy. Seriously, Scrapy is the best scraper on the planet if you know how to code in Python. If not, learn.
     
  12. YesAndNo

    YesAndNo BANNED BANNED

    Joined:
    Nov 20, 2017
    Messages:
    25
    Likes Received:
    35
    Gender:
    Male
    How long have you been programming with Scrapy?
     
  13. bigot

    bigot Registered Member

    Joined:
    May 9, 2017
    Messages:
    76
    Likes Received:
    39
    Gender:
    Male
    Occupation:
    Programmer
    Location:
    Canada
    That's a good example of what not to do, lol.

    It depends on your experience level with PHP and Java, your understanding of HTTP requests/responses, and HTML.

    If you're using PHP, definitely opt for cURL instead of file_get_contents. PHP's cURL implementation is fairly easy to learn. If you already understand HTTP, it's a breeze. Then for extracting data, you can use DOM if you prefer object-oriented programming (if you like Java, you probably love OOP). If you prefer procedural programming, go with regex. Regardless of your preference, you should consider the project at hand and what would work better.

    If you're using Java, there's more setup and crap to deal with. But it's well worth it if you're setting up a program you're using often; it's easy to thread, and you can throw the .jar on a VPS very quickly and easily.

    All of this assumes you are scraping a site that doesn't heavily rely on Javascript to generate the content you want to scrape. If that's the case, see some of the other responses for JS engines. Though if you investigate the site to scrape, more likely than not it's just an AJAX request that returns easily readable JSON. Then you can skip getting the main page together.

    The best thing to do is just dive in. If you have no practical use for scraping right now, make something up. For example, try scraping phone numbers/names off YellowPages.

    Best of luck!
     
    • Thanks Thanks x 1
  14. itz_styx

    itz_styx Jr. VIP Jr. VIP

    Joined:
    May 8, 2012
    Messages:
    1,026
    Likes Received:
    527
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    that is if you want something easy for small scale scraping. if you want to scrape millions of sites for example then a scripting language isn't a good solution.
    python+scrapy is by far not the best solution sorry.
     
  15. A1coder

    A1coder Newbie

    Joined:
    Nov 21, 2017
    Messages:
    10
    Likes Received:
    1
    Gender:
    Male
    If you have some java or Python skills this would help you
     
  16. majky538

    majky538 Registered Member

    Joined:
    Mar 4, 2014
    Messages:
    85
    Likes Received:
    3
    For easy scraping, things, use PHP, download file using file_get_contents() function or preffered curl, more possibilities to define headers on so on, or guzzleHtttp/Client for async features and another ones. Next, you can use classes from DOM, but some thing can be pretty older, pain a little, but should work well.

    For dynamic pages i recommend c# Selenium as mentioned above. Pretty easy to render page, work even with javascript.