Y T Nuke  
Results 1 to 5 of 5
I'm looking for a tool that will allow me to plug in the RSS feed ...
  1. #1
    MichiganManiac's Avatar
    MichiganManiac is offline Junior Member
    Join Date
    Feb 2009
    Posts
    195
    Reputation
    16
    Thanks
    63
    Thanked 167 Times in 45 Posts

    Default Need an URL Extractor for RSS Feeds

    I'm looking for a tool that will allow me to plug in the RSS feed and it will extract all the URLs from it.

    The idea is to gather up links from blogs that I can then plug into Parameter. The built-in URL extractor in Parameter pulls everything and most of the time 2/3 of that ends up being "comments", "tags", or other junk urls that are not content and are not commentable.

    All of the actual posts in the RSS feed however ARE commentable. So it would make sense just to run data on those urls and leave everything else.

  2. #2
    Rick4691's Avatar
    Rick4691 is offline Registered Member
    Join Date
    Feb 2008
    Location
    Oceania
    Posts
    70
    Reputation
    10
    Thanks
    55
    Thanked 30 Times in 24 Posts

    Default Re: Need an URL Extractor for RSS Feeds

    Put your feeds into a plain text file called rss_list.txt, put the script below (call it something like "url_extractor.sh") into cron and voila! Your URLs will end up in a file called "new_urls.lst".

    Code:
    #!/bin/sh
    # URL Extractor - extracts URLs from each of the feeds listed i rss_list.txt
    
    rm -f urls.lst
    touch urls.lst
    
    cat rss_list.txt | \
    while read CURRENT_FEED
    do
      curl $CURRENT_FEED 2>/dev/null | grep 'a href' | \
      cut -f 2 -d \" | sort -u >> urls.lst
    done
    
    # Make sure we only process new URLs --- already 
    # processed URLs should 
    # be listed in master_urls.lst; new ones will go to
    # new_urls.lst
    comm -13 master_urls.lst urls.lst > new_urls.lst
    
    ##############################################################
    ##############################################################
    ##############################################################
    # Remember to append the new_urls.lst to the 
    # master_urls.lst for next time
    cat new_urls.lst >> master_urls.lst
    
    # For the comm command to work, the contents of both 
    # files involved must be
    # sorted and have no duplicate entries
    sort -u master_urls.lst > temp_urls.lst
    mv temp_urls.lst master_urls.lst
    
    exit 0
    You'll probably want to add some sort of filter to the loop in order to get rid of unwanted URLs. And examine your feeds' source in order to adapt the script to fit --- depending on how your feeds are formatted there might be some other tweaking required.

    See this post for a more complete version of the script, but geared toward a different purpose (without iterating through the list of RSS feeds):

    http://www.blackhatworld.com/blackha...tml#post891064
    Code:
    signature = "Insert smart-assed observation here.";
    System.out.println(signature);

  3. #3
    MichiganManiac's Avatar
    MichiganManiac is offline Junior Member
    Join Date
    Feb 2009
    Posts
    195
    Reputation
    16
    Thanks
    63
    Thanked 167 Times in 45 Posts

    Default Re: Need an URL Extractor for RSS Feeds

    So this is going to sound like a noob question...but what do you mean by "put it in cron"?

    Is there software called "Cron"?

  4. #4
    shadowpwner is offline Superlative Stuff(ing)
    Join Date
    Apr 2009
    Posts
    301
    Reputation
    12
    Thanks
    202
    Thanked 69 Times in 37 Posts

    Default Re: Need an URL Extractor for RSS Feeds

    Quote Originally Posted by MichiganManiac View Post
    So this is going to sound like a noob question...but what do you mean by "put it in cron"?

    Is there software called "Cron"?
    Google CronJobs. Basically, it tells the server to run the script x many times a day (or month, year, etc) automatically.

  5. #5
    Rick4691's Avatar
    Rick4691 is offline Registered Member
    Join Date
    Feb 2008
    Location
    Oceania
    Posts
    70
    Reputation
    10
    Thanks
    55
    Thanked 30 Times in 24 Posts

    Default Re: Need an URL Extractor for RSS Feeds

    If you're working from command-line, enter "man cron" and it will give you more information than you need.

    When you're done reading that, enter "man crontab" for more information.

    Be patient, it's worth knowing.

    "Cron" is the Unix/Linux scheduler. It allows you to run programs at set times --- every five minutes, once a day, only on February 29, etc.

    You put something into cron by entering "crontab -e" on the command-line and then entering something like this:

    Code:
    # Execute url_extractor.sh
    25 6 * * * /usr/bin/url_extractor.sh
    The space delimited fields are:
    • minute
    • hour
    • day-of-the-month
    • month
    • day-of-the-week
    • shell command (the script we want to run)


    Stars in the time fields mean "every".

    So, in this case I've set the script to run every day at 6:25 AM.

    ---

    I know that for people working through Fantastico, there is a supposedly easier graphic interface for working with cron (you'll have to look for it if you want to use it...), but I'm half Dutch so I like to do things the hard way.
    Code:
    signature = "Insert smart-assed observation here.";
    System.out.println(signature);

Natural Slow Link Building


SEO Blasts - High quality link building service

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
  SEnukeX SEO Software
Proudly Powered by Hostwinds.com Web Hosting Click Here For Exclusive BHW Discounts!

Cheap Web Hosting


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75