Scraping all articles from website?

AuraMarketing · Jul 2, 2018

Is there any tool (software/website) available that I can use to get the content (only the text) of a website.
I don't want to use winhttrack and download the complete website.

The tool should parse the article and store it in one or multiple txt/doc file from the URLs given

Additional feature: It may be able to fetch the URLs in the article and get their data too (same domain).

I tried using Expired article hunter on my list of domains (not the expired ones). But it is not generating any result. I think because it uses web.archive links.

Something similar I found >> https://lateral.io/docs/article-extractor
But it takes 1 URL at a time, and I need to copy the text. Also, it does not parse data from tables.

Zwielicht · Jul 2, 2018

Thread Moved

AuraMarketing · Jul 2, 2018

Zwielicht said:
Thread Moved

Oh sorry.. And thanks for moving it to the right place

imutsav · Jul 2, 2018

look for ubotstudio or winautomation and you would be able to create any scrapper

bartosimpsonio · Jul 2, 2018

Mirror the site with wget

It's free. Then process the files on your local disk (extract content).

AuraMarketing · Jul 2, 2018

bartosimpsonio said:
Mirror the site with wget

It's free. Then process the files on your local disk (extract content).

I don't want to mirror a website using wget/winhttrack or anything similar. I once wrote a program in java in 2012 that extracted the html, parse the data and saved the result using json. It's main purpose was to extract email ids.
But I don't have it right now. Trying to find something similar but which could extract the article.

bartosimpsonio · Jul 2, 2018

AuraMarketing said:
I don't want to mirror a website using wget/winhttrack or anything similar. I once wrote a program in java in 2012 that extracted the html, parse the data and saved the result using json. It's main purpose was to extract email ids.
But I don't have it right now. Trying to find something similar but which could extract the article.

Should be a simple bot to write.

terrycody · Jul 2, 2018

As it turns out, your

AuraMarketing said:
Something similar I found >> https://lateral.io/docs/article-extractor

still very useful even it can only extract 1 URL at a time, thanks for share!

I also need such a bot, hope you finally find a solution!

(I think your way is right, maybe first get all sitemap URLs of a website, then use "something" to parse each URL to a seperate textbook, and...)

HoNeYBiRD · Jul 2, 2018

If the site you mentioned gives you an okayish result, you can try to extract the data with iMacros. If you use it in loop, you don't have to enter or copy anything, you can determine a list of urls, the macro will go through them one by one and extract the content into a .csv in an orderly manner.

MisterF · Jul 2, 2018

Does @jimbobo2779 have something like this of the top of my head?

jimbobo2779 · Jul 2, 2018

MisterF said:
Does @jimbobo2779 have something like this of the top of my head?

Nah I don't unfortunately

Seo khan · Jul 3, 2018

Try Webharvey, very easy to scrape all articles from site

AuraMarketing · Jul 3, 2018

Thanks @HoNeYBiRD and @terrycody for the inspiration...
I found another gem >> https://boilerpipe-web.appspot.com/
This is a web based extracted based on JS. Just go to their website, enter the URL and choose Plain Text from Output mode, if that's what you are after.. Click Extract. And BOOM. You get the article in plain text format. Just copy and save it.

As per honeybird's directions, I went on to create an imacro to automate this process.
Requirements:
Firefox 55 or below - iMacros do not support Firefox 56 and above
iMacros plugin for Firefox
A csv file with the URLs you want to extract articles from. URLs must start from cell A1. Filename must be file.csv (or you need to change it in the imacros script below)
Save the csv file at C:\Users\USERNAME\Documents\iMacros\Datasources

Here's the code:

Code:

VERSION  BUILD=9030808 RECORDER=FX
'Uses a Windows script to submit several datasets to a website, e. g. for filling an online database
TAB T=1   
TAB CLOSEALLOTHERS
' Specify input file (if !COL variables are used, IIM automatically assume a CSV format of the input file
'CSV = Comma Separated Values in each line of the file
SET !DATASOURCE file.csv
SET !EXTRACT_TEST_POPUP NO
'SET !DATASOURCE_COLUMNS 1
'Start at line 2 to skip the header in the file
SET !LOOP 1
'Increase the current position in the file with each loop
SET !DATASOURCE_LINE {{!LOOP}}
' Fill web form   

SET !VAR1 EVAL("var randomNumber=Math.floor(Math.random()*10 + 1); randomNumber;")
URL GOTO=https://boilerpipe-web.appspot.com/
' waits 1 to 10 seconds
WAIT SECONDS={{!VAR1}}

TAG POS=1 TYPE=INPUT:TEXT FORM=NAME:extractForm ATTR=NAME:url CONTENT={{!COL1}}
TAG POS=1 TYPE=SELECT FORM=NAME:extractForm ATTR=ID:output CONTENT=%text
TAG POS=1 TYPE=INPUT:SUBMIT FORM=NAME:extractForm ATTR=*


 TAG POS=1 TYPE=pre ATTR=TXT:* EXTRACT=TXT
 SAVEAS TYPE=EXTRACT FOLDER=* FILE=aa.txt

 SET !VAR1 EVAL("var randomNumber=Math.floor(Math.random()*10 + 1); randomNumber;")
 ' waits 1 to 10 seconds
WAIT SECONDS={{!VAR1}}

Save the above code to > C:\Users\USERNAME\Documents\iMacros\Macros
File name > YOURFILENAME.iim

Open iMacros in Firefox, add the number of URLs you have in you csv file to the Max: option in Repeat Macro. Click on Play (Loop)
You're done.
You'll find your output file here > C:\Users\USERNAME\Documents\iMacros\Downloads

NOTE:
If you get timeout problem, you can increase the wait seconds.
Currently, I'm getting Error: InsufficientQuotaException. Will try to figure out a solution for this. I think it can be solved using proxies on Firefox.
The file locations have the word USERNAME, don't forget to replace with your computer user name.

Please let me know how this can be improved.. Thanks

terrycody · Jul 3, 2018

AuraMarketing said:
Please let me know how this can be improved.. Thanks

No idea how to use it lol

but i think it must be very useful

ah i happen to know another site do similar things u may interested :

https://www.textise.net/

hope it can inspire someone

AuraMarketing · Jul 3, 2018

terrycody said:
No idea how to use it lol

It's not that hard to use it. I just started using iMacros like 2 hours ago and built this.
Textise is good but the only problem is that it includes many unnecessary things including comments and footer text. But it you want those then yes it is good.

AuraMarketing · Jul 3, 2018

Currently, I'm getting Error: InsufficientQuotaException. Will try to figure out a solution for this. I think it can be solved using proxies on Firefox.

This was solved using free elite ssl proxy in Firefox.

HoNeYBiRD · Jul 3, 2018

AuraMarketing said:
Currently, I'm getting Error: InsufficientQuotaException. Will try to figure out a solution for this. I think it can be solved using proxies on Firefox.

This was solved using free elite ssl proxy in Firefox.

Either proxies or increase the delays (wait seconds) to the point, where you don't get the error anymore. I see you're using random delays, which is good, but you might want to increase the minimum to avoid the rate limit issue.
Using free proxies have a downside: they aren't reliable, in one moment they work, in the next one they don't, so when you try to scrape the articles, it can be that the proxy won't work and you have an empty result for that loop. Moreover if you don't set errorignore to yes, the macro will stop, because of timeout. If it's not a time-sensitive job, i'd simply increase the delays or use some cheaper rotating proxies.

viewman · Jul 3, 2018

i can build a bot for this purpose so provide me website and a i will show you how

AuraMarketing · Jul 4, 2018

HoNeYBiRD said:
Either proxies or increase the delays (wait seconds) to the point, where you don't get the error anymore. I see you're using random delays, which is good, but you might want to increase the minimum to avoid the rate limit issue.
Using free proxies have a downside: they aren't reliable, in one moment they work, in the next one they don't, so when you try to scrape the articles, it can be that the proxy won't work and you have an empty result for that loop. Moreover if you don't set errorignore to yes, the macro will stop, because of timeout. If it's not a time-sensitive job, i'd simply increase the delays or use some cheaper rotating proxies.

Thanks for the input. The free proxies work fine for around 50 URLs which is great.
And yes, I will set the errorignore to yes. I was trying to finding something similar.

AuraMarketing · Jul 4, 2018

viewman said:
i can build a bot for this purpose so provide me website and a i will show you how

Thanks, but I already have built something that does the job.

Scraping all articles from website?

Elite Member

Elite Member

Elite Member

Newbie

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Repeat Selling Out MP - Doxxing - Harass Comps.

Elite Member

Regular Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

BANNED

Elite Member

Elite Member

Main Menu

Marketplace

Making Money

BlackHat World