1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

WebScraper: need some advice

Discussion in 'C, C++, C#' started by xuinia, Mar 9, 2016.

  1. xuinia

    xuinia Newbie

    Joined:
    Mar 9, 2016
    Messages:
    4
    Likes Received:
    0
    I need to build little web scraping tool. Problem is that I haven't done any c++ programing for many years, so I have no ideas where even to start now. Few years back I made such scraper with php, but its pain in the a** to use, so I want to recode in c++ and compile standalone application. So a have few questions.
    1. Before I used c++ builder to code. Should I stick to that again? Or maybe there something better (preferably free)?
    2. I will need some sort of library to get website contents. Any advice on that?
    3. I will need to go through html to find needed data. In php I used simple_html_dom. Anything like that in c++?
    4. Any other advice very wellcome :)

    P.S. Scraped data will go to mysql database.
     
    Last edited: Mar 9, 2016
  2. Bahmer

    Bahmer Regular Member

    Joined:
    Jul 8, 2015
    Messages:
    261
    Likes Received:
    60
    I would use Python for a webscraping tool, personally I could do it much easier with python because of all the sweet sweet modules and add ons.
     
  3. Des_cartes

    Des_cartes Junior Member

    Joined:
    Jan 19, 2012
    Messages:
    160
    Likes Received:
    64
    +1 For Python + Scrapy, it will make things way easier for you.
     
    • Thanks Thanks x 1
  4. Bahmer

    Bahmer Regular Member

    Joined:
    Jul 8, 2015
    Messages:
    261
    Likes Received:
    60
    Even beautiful soup or any of those would be better than writing it in C++. God that just gives me the heaby jeebies thinking about it haha.
     
  5. xuinia

    xuinia Newbie

    Joined:
    Mar 9, 2016
    Messages:
    4
    Likes Received:
    0
    The only reason why I think of c++ is cause I did some "advanced begginers :) " programing few years back. So I thaught it would be easyer.
    Don't know anything about python. It's just scripting (similar to php) or you can compile actuall app?
     
  6. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP

    Joined:
    Nov 14, 2011
    Messages:
    359
    Likes Received:
    206
    c++ while be harder if you haven't built any type of scrapers before and have only done advanced beginners stuff.

    You mention compile standalone app, what's your reason for this sell as product,distribute to staff?

    thanks
    9to5
     
  7. xuinia

    xuinia Newbie

    Joined:
    Mar 9, 2016
    Messages:
    4
    Likes Received:
    0
    Few people will use that stuff on different mashines, so standalone app would be best solution I gues.
     
  8. 9to5destroyer

    9to5destroyer Jr. VIP Jr. VIP

    Joined:
    Nov 14, 2011
    Messages:
    359
    Likes Received:
    206
    I would say c# then for standalone app, but that's my preferred language so slightly biased.
     
  9. cnick79

    cnick79 Jr. VIP Jr. VIP

    Joined:
    Jun 10, 2010
    Messages:
    686
    Likes Received:
    369
    Location:
    Google's SandBox
    If you have JavaScript experience, then have a look at Node JS. No better way to scrape HTML than using something built to traverse the DOM.
     
  10. jimbobo2779

    jimbobo2779 Jr. VIP Jr. VIP

    Joined:
    Sep 17, 2008
    Messages:
    3,644
    Likes Received:
    2,618
    Occupation:
    Software Engineer
    Location:
    UK
    Home Page:
    If you use C# you can use Microsoft Visual Studio Express which is a really good free IDE, if you only have a basic knowledge of coding you may find it easier in the long run to learn some C# as it is a fairly quick language to pick up, more so than c++.

    With c# you could then use the HtmlAgilityPack, again free, to parse through the document as if it were an xml document which is really easy.

    Whenever I have gotten stuck with something in C# there is often as tutorial (admittedly there are tuts for pretty much all modern languages) and it doesn't sound like you are looking to do anything too crazy.

    For interfacing with a MySQL db you can use MySQL Connector/Net, again free, which makes it just as easy as it would be in something like php.

    One of my biggest bugbears is creating a functional yet not overly hideous UI for my software and having something like Visual Studio Express makes at least basic UIs quick and easy to create so you can get down to the nitty gritty of coding much quicker.
     
  11. ToxicBlack

    ToxicBlack Regular Member

    Joined:
    Mar 25, 2016
    Messages:
    223
    Likes Received:
    57
    Occupation:
    Programming custom bots and tools.
    Location:
    botland
    Well C++ is the hard way compared to C#/Python but it can be very powerful.

    Today C++ with Qt is very simple to work with, even they have very nice IDE for programming.

    Check their documentation and you will see...
     
  12. blackcodez

    blackcodez Newbie

    Joined:
    Oct 10, 2014
    Messages:
    28
    Likes Received:
    6
    Deff. use Python. It was built by google for the purposes of quick processing with low CPU requirements. Python was basically built for math (algorithms) and data collection (scraping). With Python you can then run multi-threaded workers ;)
    For the guy that said Node.js/Javascript...bro...you can't. While technically you can, it requires to much CPU in order to process everything that it would only be cost effective to run it in a single thread.
     
  13. Dev Warrior

    Dev Warrior Jr. VIP Jr. VIP Premium Member

    Joined:
    Oct 13, 2015
    Messages:
    253
    Likes Received:
    30
    Home Page:
    I would suggest you to start with C# cause it has a number of base classes/functions already written to write code with more ease & performance. Also, there are a lot of 3rd party framework/api has already shared for free like HTMLAgility, CSQuery, AngleSharp etc for processing HTML pages. Visual Web Express is a light weight & FREE IDE to write code fast!
     
  14. alphawow

    alphawow Newbie

    Joined:
    Jul 17, 2013
    Messages:
    10
    Likes Received:
    0
    Best solution will be phantomjs with python or C# - easy and fast.
     
  15. ekapek

    ekapek Jr. VIP Jr. VIP Premium Member

    Joined:
    Aug 2, 2010
    Messages:
    266
    Likes Received:
    47
    Home Page:
    There are a few very good open source solution for scrapping - like mentioned here scrapy (python) It is highly configurable and has very good performance. You can set it up and forget.
     
  16. virtualprotect

    virtualprotect Newbie

    Joined:
    Dec 21, 2013
    Messages:
    25
    Likes Received:
    1
    I used C# to make my proxy scraper. I recommend you use C# or Vb.net if you're a beginner.