Black Hat Forum Black Hat SEO The only backlink provider with unlimited projects/links per day!
Go Back   Black Hat Forum Black Hat SEO > Black Hat SEO > Black Hat SEO Tools

Black Hat SEO Tools Black Hat Tools: You can't GO it alone...have an arsenal of programs to help you out!

Mad Content   BLOG SEO
Search
 
LinkWheel

LiveChatAgent



Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 07-06-2009, 04:47 PM
MichiganManiac's Avatar
Junior Member
 
Join Date: Feb 2009
Posts: 190
Thanks: 62
Thanked 164 Times in 45 Posts
Reputation: 16
iTrader: (1)
Default Need an URL Extractor for RSS Feeds

I'm looking for a tool that will allow me to plug in the RSS feed and it will extract all the URLs from it.

The idea is to gather up links from blogs that I can then plug into Parameter. The built-in URL extractor in Parameter pulls everything and most of the time 2/3 of that ends up being "comments", "tags", or other junk urls that are not content and are not commentable.

All of the actual posts in the RSS feed however ARE commentable. So it would make sense just to run data on those urls and leave everything else.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 07-06-2009, 07:21 PM
Rick4691's Avatar
Registered Member
 
Join Date: Feb 2008
Location: Oceania
Posts: 70
Thanks: 55
Thanked 30 Times in 24 Posts
Reputation: 10
iTrader: (0)
Default Re: Need an URL Extractor for RSS Feeds

Put your feeds into a plain text file called rss_list.txt, put the script below (call it something like "url_extractor.sh") into cron and voila! Your URLs will end up in a file called "new_urls.lst".

Code:
#!/bin/sh
# URL Extractor - extracts URLs from each of the feeds listed i rss_list.txt

rm -f urls.lst
touch urls.lst

cat rss_list.txt | \
while read CURRENT_FEED
do
  curl $CURRENT_FEED 2>/dev/null | grep 'a href' | \
  cut -f 2 -d \" | sort -u >> urls.lst
done

# Make sure we only process new URLs --- already 
# processed URLs should 
# be listed in master_urls.lst; new ones will go to
# new_urls.lst
comm -13 master_urls.lst urls.lst > new_urls.lst

##############################################################
##############################################################
##############################################################
# Remember to append the new_urls.lst to the 
# master_urls.lst for next time
cat new_urls.lst >> master_urls.lst

# For the comm command to work, the contents of both 
# files involved must be
# sorted and have no duplicate entries
sort -u master_urls.lst > temp_urls.lst
mv temp_urls.lst master_urls.lst

exit 0
You'll probably want to add some sort of filter to the loop in order to get rid of unwanted URLs. And examine your feeds' source in order to adapt the script to fit --- depending on how your feeds are formatted there might be some other tweaking required.

See this post for a more complete version of the script, but geared toward a different purpose (without iterating through the list of RSS feeds):

http://www.blackhatworld.com/blackha...tml#post891064
__________________
Code:
signature = "Insert smart-assed observation here.";
System.out.println(signature);
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 07-06-2009, 08:11 PM
MichiganManiac's Avatar
Junior Member
 
Join Date: Feb 2009
Posts: 190
Thanks: 62
Thanked 164 Times in 45 Posts
Reputation: 16
iTrader: (1)
Default Re: Need an URL Extractor for RSS Feeds

So this is going to sound like a noob question...but what do you mean by "put it in cron"?

Is there software called "Cron"?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 07-06-2009, 08:13 PM
Superlative Stuff(ing)
 
Join Date: Apr 2009
Posts: 301
Thanks: 202
Thanked 68 Times in 37 Posts
Reputation: 12
iTrader: (0)
Default Re: Need an URL Extractor for RSS Feeds

Quote:
Originally Posted by MichiganManiac View Post
So this is going to sound like a noob question...but what do you mean by "put it in cron"?

Is there software called "Cron"?
Google CronJobs. Basically, it tells the server to run the script x many times a day (or month, year, etc) automatically.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 07-06-2009, 08:27 PM
Rick4691's Avatar
Registered Member
 
Join Date: Feb 2008
Location: Oceania
Posts: 70
Thanks: 55
Thanked 30 Times in 24 Posts
Reputation: 10
iTrader: (0)
Default Re: Need an URL Extractor for RSS Feeds

If you're working from command-line, enter "man cron" and it will give you more information than you need.

When you're done reading that, enter "man crontab" for more information.

Be patient, it's worth knowing.

"Cron" is the Unix/Linux scheduler. It allows you to run programs at set times --- every five minutes, once a day, only on February 29, etc.

You put something into cron by entering "crontab -e" on the command-line and then entering something like this:

Code:
# Execute url_extractor.sh
25 6 * * * /usr/bin/url_extractor.sh
The space delimited fields are:
  • minute
  • hour
  • day-of-the-month
  • month
  • day-of-the-week
  • shell command (the script we want to run)

Stars in the time fields mean "every".

So, in this case I've set the script to run every day at 6:25 AM.

---

I know that for people working through Fantastico, there is a supposedly easier graphic interface for working with cron (you'll have to look for it if you want to use it...), but I'm half Dutch so I like to do things the hard way.
__________________
Code:
signature = "Insert smart-assed observation here.";
System.out.println(signature);
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply

Bookmarks

Backlinks Genie

SE Nuke



Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


SEO Paladin


Web Hosting
Copyright © 2005 - 2012 BlackHatWorld.com All rights reserved.