Scraping all articles from website?

AuraMarketing

Elite Member
Joined
Apr 10, 2018
Messages
1,531
Reaction score
709
Website
bit.ly
Is there any tool (software/website) available that I can use to get the content (only the text) of a website.
I don't want to use winhttrack and download the complete website.

The tool should parse the article and store it in one or multiple txt/doc file from the URLs given

Additional feature: It may be able to fetch the URLs in the article and get their data too (same domain).

I tried using Expired article hunter on my list of domains (not the expired ones). But it is not generating any result. I think because it uses web.archive links.

Something similar I found >> https://lateral.io/docs/article-extractor
But it takes 1 URL at a time, and I need to copy the text. Also, it does not parse data from tables.
 
Thread Moved
 
look for ubotstudio or winautomation and you would be able to create any scrapper
 
Mirror the site with wget

It's free. Then process the files on your local disk (extract content).
I don't want to mirror a website using wget/winhttrack or anything similar. I once wrote a program in java in 2012 that extracted the html, parse the data and saved the result using json. It's main purpose was to extract email ids.
But I don't have it right now. Trying to find something similar but which could extract the article.
 
I don't want to mirror a website using wget/winhttrack or anything similar. I once wrote a program in java in 2012 that extracted the html, parse the data and saved the result using json. It's main purpose was to extract email ids.
But I don't have it right now. Trying to find something similar but which could extract the article.

Should be a simple bot to write.
 
As it turns out, your


still very useful even it can only extract 1 URL at a time, thanks for share!

I also need such a bot, hope you finally find a solution!

(I think your way is right, maybe first get all sitemap URLs of a website, then use "something" to parse each URL to a seperate textbook, and...)
 
If the site you mentioned gives you an okayish result, you can try to extract the data with iMacros. If you use it in loop, you don't have to enter or copy anything, you can determine a list of urls, the macro will go through them one by one and extract the content into a .csv in an orderly manner.
 
Thanks @HoNeYBiRD and @terrycody for the inspiration...
I found another gem >> https://boilerpipe-web.appspot.com/
This is a web based extracted based on JS. Just go to their website, enter the URL and choose Plain Text from Output mode, if that's what you are after.. Click Extract. And BOOM. You get the article in plain text format. Just copy and save it.

As per honeybird's directions, I went on to create an imacro to automate this process.
Requirements:
Firefox 55 or below - iMacros do not support Firefox 56 and above
iMacros plugin for Firefox
A csv file with the URLs you want to extract articles from. URLs must start from cell A1. Filename must be file.csv (or you need to change it in the imacros script below)
Save the csv file at C:\Users\USERNAME\Documents\iMacros\Datasources

Here's the code:

Code:
VERSION  BUILD=9030808 RECORDER=FX
'Uses a Windows script to submit several datasets to a website, e. g. for filling an online database
TAB T=1   
TAB CLOSEALLOTHERS
' Specify input file (if !COL variables are used, IIM automatically assume a CSV format of the input file
'CSV = Comma Separated Values in each line of the file
SET !DATASOURCE file.csv
SET !EXTRACT_TEST_POPUP NO
'SET !DATASOURCE_COLUMNS 1
'Start at line 2 to skip the header in the file
SET !LOOP 1
'Increase the current position in the file with each loop
SET !DATASOURCE_LINE {{!LOOP}}
' Fill web form   

SET !VAR1 EVAL("var randomNumber=Math.floor(Math.random()*10 + 1); randomNumber;")
URL GOTO=https://boilerpipe-web.appspot.com/
' waits 1 to 10 seconds
WAIT SECONDS={{!VAR1}}

TAG POS=1 TYPE=INPUT:TEXT FORM=NAME:extractForm ATTR=NAME:url CONTENT={{!COL1}}
TAG POS=1 TYPE=SELECT FORM=NAME:extractForm ATTR=ID:output CONTENT=%text
TAG POS=1 TYPE=INPUT:SUBMIT FORM=NAME:extractForm ATTR=*


 TAG POS=1 TYPE=pre ATTR=TXT:* EXTRACT=TXT
 SAVEAS TYPE=EXTRACT FOLDER=* FILE=aa.txt

 SET !VAR1 EVAL("var randomNumber=Math.floor(Math.random()*10 + 1); randomNumber;")
 ' waits 1 to 10 seconds
WAIT SECONDS={{!VAR1}}


Save the above code to > C:\Users\USERNAME\Documents\iMacros\Macros
File name > YOURFILENAME.iim

Open iMacros in Firefox, add the number of URLs you have in you csv file to the Max: option in Repeat Macro. Click on Play (Loop)
You're done.
You'll find your output file here > C:\Users\USERNAME\Documents\iMacros\Downloads

NOTE:
If you get timeout problem, you can increase the wait seconds.
Currently, I'm getting Error: InsufficientQuotaException. Will try to figure out a solution for this. I think it can be solved using proxies on Firefox.
The file locations have the word USERNAME, don't forget to replace with your computer user name.

Please let me know how this can be improved.. Thanks :)
 
No idea how to use it lol
It's not that hard to use it. I just started using iMacros like 2 hours ago and built this.
Textise is good but the only problem is that it includes many unnecessary things including comments and footer text. But it you want those then yes it is good.
 
Currently, I'm getting Error: InsufficientQuotaException. Will try to figure out a solution for this. I think it can be solved using proxies on Firefox.

This was solved using free elite ssl proxy in Firefox.
 
Currently, I'm getting Error: InsufficientQuotaException. Will try to figure out a solution for this. I think it can be solved using proxies on Firefox.

This was solved using free elite ssl proxy in Firefox.
Either proxies or increase the delays (wait seconds) to the point, where you don't get the error anymore. I see you're using random delays, which is good, but you might want to increase the minimum to avoid the rate limit issue.
Using free proxies have a downside: they aren't reliable, in one moment they work, in the next one they don't, so when you try to scrape the articles, it can be that the proxy won't work and you have an empty result for that loop. Moreover if you don't set errorignore to yes, the macro will stop, because of timeout. If it's not a time-sensitive job, i'd simply increase the delays or use some cheaper rotating proxies.
 
i can build a bot for this purpose so provide me website and a i will show you how
 
Either proxies or increase the delays (wait seconds) to the point, where you don't get the error anymore. I see you're using random delays, which is good, but you might want to increase the minimum to avoid the rate limit issue.
Using free proxies have a downside: they aren't reliable, in one moment they work, in the next one they don't, so when you try to scrape the articles, it can be that the proxy won't work and you have an empty result for that loop. Moreover if you don't set errorignore to yes, the macro will stop, because of timeout. If it's not a time-sensitive job, i'd simply increase the delays or use some cheaper rotating proxies.
Thanks for the input. The free proxies work fine for around 50 URLs which is great.
And yes, I will set the errorignore to yes. I was trying to finding something similar. :)
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock