List of Web Scraping Tools to Extract Online Data

crossline

Elite Member
Premium Member
Apr 20, 2018
3,196
2,061
Some of them are free, and some have trial periods and premium plans.
1. Collect Data for Market Research
2. Extract Contact Info
3. Download Solutions from StackOverflow
4. Look for Jobs or Candidates
5. Track Prices from Multiple Markets
Code:
https://www.import.io/
https://dexi.io/
http://scrapinghub.com/
https://www.parsehub.com/
http://80legs.com/
https://www.scraperapi.com/
https://www.scrapesimple.com/
https://www.octoparse.com/
https://scrapy.org
https://www.diffbot.com
https://cheerio.js.org
https://www.crummy.com/software/BeautifulSoup/ 
https://github.com/GoogleChrome/puppeteer
https://www.mozenda.com/

There may be better suggestions from the members here.
 
Nice share, I was wondering if any of them can be used for generating LinkedIn leads?
 
Every thread about it I see someone mention it's easy to detect. That true?
That's true, but there are ways to modify the driver to avoid being detected. For example, $cdc_ flags can be changed using a hex editor (thanks to @rootjazz for the idea). Out of the box, chrome driver is pretty easy to detect. I am not sure about ff driver by the way. I don't use it often.
 
That's elite Black Hat right there. Thanks for the tip!
There are several other ways though. $cdc is just one of them. We have window.webdriver for example (or something like that. I am on mobile now, so I am too lazy to google). Plus the browser agent string, initial cursor position, initial browser dimensions and much more. Do google these and fix these before you try automating with selenium. Or you might lose your precious accounts.
 
Yeah, that's true though. Selenium can be multi threaded too, but it will eat all of your ram. It doesn't scale that well.
yep just try running it with +100 threads and watch your machine die if it doesnt have a ton of RAM and CPU :D
 
yep just try running it with +100 threads and watch your machine die if it doesnt have a ton of RAM and CPU :D
True. Actually I use it more for UX testing than scraping. For scraping, there are better ways.
 
nah thats too slow and bloated. its better to do with sockets and multithreaded :)

Too far the other way IMO. Raw socket programming is not necessary, why reinvent the wheel? Suitable HTTP webrequest / response abstractions exist on top of that that give you the 95% speed of writing your own tcp listener code but doesn't require you to write the raw headers to a stream yourself.


yep just try running it with +100 threads and watch your machine die if it doesnt have a ton of RAM and CPU :D

That isn't how you thread selenium. You need a system to spin up a cloud VPS, run your code, then close down.

But it's choosing the right tool for the job. IF you have complex javascript to process / obsfuscated IDs to work out, some random seemingly donothing JS log call is actually being used by it's omission to ban you, then throwing selenium at the problem will save you infinite headaches and be the better tool.

HTTP requests should always be the preferred choice, but sometimes a browser is required
 
There are several other ways though. $cdc is just one of them. We have window.webdriver for example (or something like that. I am on mobile now, so I am too lazy to google). Plus the browser agent string, initial cursor position, initial browser dimensions and much more. Do google these and fix these before you try automating with selenium. Or you might lose your precious accounts.

Also if running multiple accounts in the same setup, browser fingerprinting can ID you, but you can simply just add some plugins to handle those.
 
Too far the other way IMO. Raw socket programming is not necessary, why reinvent the wheel? Suitable HTTP webrequest / response abstractions exist on top of that that give you the 95% speed of writing your own tcp listener code but doesn't require you to write the raw headers to a stream yourself.




That isn't how you thread selenium. You need a system to spin up a cloud VPS, run your code, then close down.

But it's choosing the right tool for the job. IF you have complex javascript to process / obsfuscated IDs to work out, some random seemingly donothing JS log call is actually being used by it's omission to ban you, then throwing selenium at the problem will save you infinite headaches and be the better tool.

HTTP requests should always be the preferred choice, but sometimes a browser is required
raw sockets because its the fastest and low on resources for scraping which in most cases works just fine. of course using http requests is also good.
as for selenium, it is way too bloated if you plan on running something multithreaded and fast such as a webspider that scrapes data from hundreds of thousands or even millions of domains.
if you have to execute javascript, there are sandboxed solutions where you can execute the js and get the data you need and return it in a http request.
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock