1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Need an URL Extractor for RSS Feeds

Discussion in 'Black Hat SEO Tools' started by MichiganManiac, Jul 6, 2009.

  1. MichiganManiac

    MichiganManiac Regular Member

    Joined:
    Feb 2, 2009
    Messages:
    204
    Likes Received:
    168
    I'm looking for a tool that will allow me to plug in the RSS feed and it will extract all the URLs from it.

    The idea is to gather up links from blogs that I can then plug into Parameter. The built-in URL extractor in Parameter pulls everything and most of the time 2/3 of that ends up being "comments", "tags", or other junk urls that are not content and are not commentable.

    All of the actual posts in the RSS feed however ARE commentable. So it would make sense just to run data on those urls and leave everything else.
     
  2. Rick4691

    Rick4691 Registered Member Premium Member

    Joined:
    Feb 19, 2008
    Messages:
    70
    Likes Received:
    30
    Occupation:
    Programmer
    Location:
    Oceania
    Put your feeds into a plain text file called rss_list.txt, put the script below (call it something like "url_extractor.sh") into cron and voila! Your URLs will end up in a file called "new_urls.lst".

    Code:
    #!/bin/sh
    # URL Extractor - extracts URLs from each of the feeds listed i rss_list.txt
    
    rm -f urls.lst
    touch urls.lst
    
    cat rss_list.txt | \
    while read CURRENT_FEED
    do
      curl $CURRENT_FEED 2>/dev/null | grep 'a href' | \
      cut -f 2 -d \" | sort -u >> urls.lst
    done
    
    # Make sure we only process new URLs --- already 
    # processed URLs should 
    # be listed in master_urls.lst; new ones will go to
    # new_urls.lst
    comm -13 master_urls.lst urls.lst > new_urls.lst
    
    ##############################################################
    ##############################################################
    ##############################################################
    # Remember to append the new_urls.lst to the 
    # master_urls.lst for next time
    cat new_urls.lst >> master_urls.lst
    
    # For the comm command to work, the contents of both 
    # files involved must be
    # sorted and have no duplicate entries
    sort -u master_urls.lst > temp_urls.lst
    mv temp_urls.lst master_urls.lst
    
    exit 0
    
    You'll probably want to add some sort of filter to the loop in order to get rid of unwanted URLs. And examine your feeds' source in order to adapt the script to fit --- depending on how your feeds are formatted there might be some other tweaking required.

    See this post for a more complete version of the script, but geared toward a different purpose (without iterating through the list of RSS feeds):

    http://www.blackhatworld.com/blackh...ions-about-grabbing-rss-feeds.html#post891064
     
  3. MichiganManiac

    MichiganManiac Regular Member

    Joined:
    Feb 2, 2009
    Messages:
    204
    Likes Received:
    168
    So this is going to sound like a noob question...but what do you mean by "put it in cron"?

    Is there software called "Cron"?
     
  4. shadowpwner

    shadowpwner Regular Member

    Joined:
    Apr 19, 2009
    Messages:
    300
    Likes Received:
    73
    Google CronJobs. Basically, it tells the server to run the script x many times a day (or month, year, etc) automatically.
     
  5. Rick4691

    Rick4691 Registered Member Premium Member

    Joined:
    Feb 19, 2008
    Messages:
    70
    Likes Received:
    30
    Occupation:
    Programmer
    Location:
    Oceania
    If you're working from command-line, enter "man cron" and it will give you more information than you need.

    When you're done reading that, enter "man crontab" for more information.

    Be patient, it's worth knowing.

    "Cron" is the Unix/Linux scheduler. It allows you to run programs at set times --- every five minutes, once a day, only on February 29, etc.

    You put something into cron by entering "crontab -e" on the command-line and then entering something like this:

    Code:
    # Execute url_extractor.sh
    25 6 * * * /usr/bin/url_extractor.sh
    The space delimited fields are:
    • minute
    • hour
    • day-of-the-month
    • month
    • day-of-the-week
    • shell command (the script we want to run)

    Stars in the time fields mean "every".

    So, in this case I've set the script to run every day at 6:25 AM.

    ---

    I know that for people working through Fantastico, there is a supposedly easier graphic interface for working with cron (you'll have to look for it if you want to use it...), but I'm half Dutch so I like to do things the hard way. ;)