1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[HELP] Finding content to

Discussion in 'General Scripting Chat' started by woofoo, Nov 25, 2011.

  1. woofoo

    woofoo Junior Member

    Joined:
    Oct 19, 2011
    Messages:
    123
    Likes Received:
    17
    Hi, guys!
    I'm writing a software for autoblogging. I've faced a problem: how to find content? I mean how to scrape it from random page?
     
  2. johndea

    johndea Regular Member

    Joined:
    Jun 23, 2011
    Messages:
    308
    Likes Received:
    35
    Load a list of keywords.
    Search google for the keywords.
    Pick a random result page.
    Strip HTML from the result.
     
  3. woofoo

    woofoo Junior Member

    Joined:
    Oct 19, 2011
    Messages:
    123
    Likes Received:
    17
    As a result I'll get menus, footer, headers and a lot of different stuff. Also, I'll loose images and mark up for an article. That's a problem
     
  4. infTee

    infTee Junior Member

    Joined:
    Mar 2, 2010
    Messages:
    101
    Likes Received:
    97
    Location:
    Ireland
    You could pick a few large article directories and scrape according to their sites design. Or examine how articles are contained within popular themes for WP article directories. Then scrape google using your keywords plus footprints from the themes, that way you have a better chance of cleaner content.
     
  5. sockpuppet

    sockpuppet Junior Member

    Joined:
    Nov 7, 2011
    Messages:
    155
    Likes Received:
    145
    look for a div element with "content" in it's class attribute, from there extract all child p tags (tag = paragraph). if you don't find such a div just look for p tags with lots of text.

    how to do it? the best way is to use an html parser. if you are using python, beautiful soup is the way to go. with java i would go with htmlcleaner and xpath. and for nodejs exists a complete dom implementation, you can use the jquery CSS selector engine sizzle on it. if you use another language google...
     
  6. weedsmoker

    weedsmoker Junior Member

    Joined:
    May 2, 2011
    Messages:
    190
    Likes Received:
    79
    You must customize extraction process for each site you scrape. I use 2 methods, first is xpath like previous poster wrote.

    title = xpathFind(html,'//h1[@id="title]')
    article = xpathFind(html,'//div[@id="article"]')

    Second method is finding content by string:
    title = stringFind(html,'<h1 id="title">','</h1>')
    article = stringFind(html,'<div id="article">','</div>')

    Use xpath, when it fails then find content by string.
     
  7. johndea

    johndea Regular Member

    Joined:
    Jun 23, 2011
    Messages:
    308
    Likes Received:
    35