1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Extracting / parsing articles from scraped websites?

Discussion in 'General Programming Chat' started by madblacker, May 22, 2010.

  1. madblacker

    madblacker Regular Member

    Joined:
    Nov 2, 2009
    Messages:
    268
    Likes Received:
    19
    I want to extract articles from websites I am scraping.. the only easy was I see to do this is by finding rss feeds that are full length, which is hard to find and then I'm missing out on all those that don't..

    I have programmers working for me but I'm not sure how to approach this issue since the articles are from different websites, its not like you can make just one extractor app based on their layout since it changes from site to site.. I have thought of something that would do something like take each area of unbroken text (meaning text that appears within 1 DIV or Table) and then determine the length of this and then the text area with the longest length would be determined to be the article.. anyways, just wondering if anyone has made anything like this before?
     
  2. gesmaster

    gesmaster Newbie

    Joined:
    Jan 27, 2009
    Messages:
    32
    Likes Received:
    9
    I run 2 scripts for my autoblogs that can extract the full articles.
    I've tried a lot a solutions and my advice is not to rely on layout but to base the script on text density.
    Posted via Mobile Device
     
    • Thanks Thanks x 1
  3. madblacker

    madblacker Regular Member

    Joined:
    Nov 2, 2009
    Messages:
    268
    Likes Received:
    19