ScrapeBox Article Scraper [Markers]

Roger Stoned

Newbie
Joined
May 9, 2020
Messages
2
Reaction score
0
I own the Mac version of Scrapebox (SB) and am thinking about buying another license for my VPS. I am wondering if SB+Article Scraper Plugin can do this:

Goal:
Import a list of mostly WordPress URLs and scrape all the [Posts].

Problems:

1. Markers

Can I use the markers <title> <h1> ... <P>
Are those specific enough to extract the text from the articles? In the tutorial videos @loopline shows how to find markers for a directory site but I am looking for a way to scrape articles from lots of smaller sites. So spending time finding markers for a blog with 10 articles isn't a wise use of my time.

2. Only Scrape Posts

If I put a Root URL and ask SB to scrape every <title> <h1> <p> Scrapebox can NOT differentiate between pages and posts. Scraping pages with 'article' or 'blog' in the URL would work, but that would skip articles that use different permalinks. I might be asking for too much of the software with this one but is there a way I can do this (other than just collecting the exact URL for each blog)?


PS: Loopline, if you are reading this you should consider replacing the old video on the article scraping plug-in page. In the old video from 2013 you could only scrape 4 directory sites but since then it's been updated (and you already have the video showing off the fancy new features on the updated plugin).
 

arpitagarwal82

Jr. VIP
Jr. VIP
Joined
Feb 20, 2008
Messages
969
Reaction score
705
I haven't checked this practically yet, but after reading your question, this would be my approach.
1) scrape URLs of wordpress blogs.
2) now for each examplesite.com, generate post URLs like examplesite.com/p=1, eaftxamplesite.com/p=2, ...... , examplesite.com/p=1000
3) Bulk check http status of all generated links.
4) remove 404s
5) Now you have list of all posts lniks from all scraped wordpress blogs.
Do whatever you like with this list.

If you fail to find solution, please let me know. I will try to run a small project and find a procedure to do what you want.

EDIT: Didn't notice this was your post... Welcome to BHW
 

arpitagarwal82

Jr. VIP
Jr. VIP
Joined
Feb 20, 2008
Messages
969
Reaction score
705
Sitemap scraper addon + class="entry-content"
This is better approach.

I don't know where was my mind. Permalink structure p= doesn't differentiate between post and page.
 

loopline

Jr. VIP
Jr. VIP
Joined
Jan 25, 2009
Messages
5,755
Reaction score
3,227
Website
contactformmarketing.com
I own the Mac version of Scrapebox (SB) and am thinking about buying another license for my VPS. I am wondering if SB+Article Scraper Plugin can do this:

Goal:
Import a list of mostly WordPress URLs and scrape all the [Posts].

Problems:

1. Markers

Can I use the markers <title> <h1> ... <P>
Are those specific enough to extract the text from the articles? In the tutorial videos @loopline shows how to find markers for a directory site but I am looking for a way to scrape articles from lots of smaller sites. So spending time finding markers for a blog with 10 articles isn't a wise use of my time.

2. Only Scrape Posts
If I put a Root URL and ask SB to scrape every <title> <h1> <p> Scrapebox can NOT differentiate between pages and posts. Scraping pages with 'article' or 'blog' in the URL would work, but that would skip articles that use different permalinks. I might be asking for too much of the software with this one but is there a way I can do this (other than just collecting the exact URL for each blog)?


PS: Loopline, if you are reading this you should consider replacing the old video on the article scraping plug-in page. In the old video from 2013 you could only scrape 4 directory sites but since then it's been updated (and you already have the video showing off the fancy new features on the updated plugin).
The thing I have found is that even though there are html standards, all sorts of different sites do it differently. So basically you can try this, but results may vary if its on lots of little sites.

#2 - just using the best markers you can find is best if there is plenty of articles to be had. IF you need total accuracy then you will have to look a the site structure of any sites that do not do it in a common way, and figure out how to collect those urls on those sites specifically.

Thanks, but I dont' work for scrapebox. I know they are working on the website anyway, so Im sure they will replace it, that or I will record an entirely new video or they will just put both or I don't know, lol.
 

dandan594594

Senior Member
Joined
Jan 31, 2013
Messages
1,088
Reaction score
703
Do you think I am going to have a good time using those markers to extract the article content?
Yes, this parameter will scrape post content, but will also pull things like data tables codes and stuff. Pretty easy to clean up this kind of thing and pages afterwards.
 
Top