How would you scrape content from these sites?

Discussion in 'General Scripting Chat' started by Frankie4Fingers, Mar 27, 2010.

  1. Frankie4Fingers

    Frankie4Fingers Power Member

    Joined:
    Jan 8, 2009
    Messages:
    681
    Likes Received:
    214
    Basically I'd like to scrape the results of these custom search engines and put it on one of my Wordpress site:

    1) http://www.raiway.rai.it/index.php?lang=IT (click on the map of Italy, then on the city, then on the town to see the results page)

    2) http://www.mediasetpremium.mediaset.it/informazione/copertura/copertura.shtml


    Looking around in the forum I found this program suggested for a similar problem:

    http://simplehtmldom.sourceforge.net/

    Do you think it could work for my case or would you suggest another solution?

    Thank you. :)
     
  2. c0ntenth|ef

    c0ntenth|ef Power Member

    Joined:
    May 20, 2009
    Messages:
    788
    Likes Received:
    118
    Location:
    california
    do the have rss feeds? just fetch their rss and put it on ur own site
     
  3. Frankie4Fingers

    Frankie4Fingers Power Member

    Joined:
    Jan 8, 2009
    Messages:
    681
    Likes Received:
    214
    If they had feeds, I wouldn't have asked this question ;)
     
  4. Deprecated

    Deprecated Registered Member

    Joined:
    May 19, 2009
    Messages:
    78
    Likes Received:
    25
    Unfortunately that's a job for Perl. If I have to script a scraper for something like that I use Perl's LWP library and some regular expressions to get the job done. This might be something to hire a coder for.
     
    • Thanks Thanks x 1
  5. Frankie4Fingers

    Frankie4Fingers Power Member

    Joined:
    Jan 8, 2009
    Messages:
    681
    Likes Received:
    214
    Do you have any idea on how much that would cost? And doing this way, would it work like RSS feed parsing (e.g., I want on my page only content related to a specific town, I enter in the script as a keyword the name of that town and then the information correlated get scraped and put on my page)?