[HELP] Finding content to

woofoo

Junior Member
Joined
Oct 19, 2011
Messages
122
Reaction score
17
Hi, guys!
I'm writing a software for autoblogging. I've faced a problem: how to find content? I mean how to scrape it from random page?
 
Load a list of keywords.
Search google for the keywords.
Pick a random result page.
Strip HTML from the result.
 
Load a list of keywords.
Search google for the keywords.
Pick a random result page.
Strip HTML from the result.

As a result I'll get menus, footer, headers and a lot of different stuff. Also, I'll loose images and mark up for an article. That's a problem
 
You could pick a few large article directories and scrape according to their sites design. Or examine how articles are contained within popular themes for WP article directories. Then scrape google using your keywords plus footprints from the themes, that way you have a better chance of cleaner content.
 
look for a div element with "content" in it's class attribute, from there extract all child p tags (tag = paragraph). if you don't find such a div just look for p tags with lots of text.

how to do it? the best way is to use an html parser. if you are using python, beautiful soup is the way to go. with java i would go with htmlcleaner and xpath. and for nodejs exists a complete dom implementation, you can use the jquery CSS selector engine sizzle on it. if you use another language google...
 
As a result I'll get menus, footer, headers and a lot of different stuff. Also, I'll loose images and mark up for an article. That's a problem

You must customize extraction process for each site you scrape. I use 2 methods, first is xpath like previous poster wrote.

title = xpathFind(html,'//h1[@id="title]')
article = xpathFind(html,'//div[@id="article"]')

Second method is finding content by string:
title = stringFind(html,'<h1 id="title">','</h1>')
article = stringFind(html,'<div id="article">','</div>')

Use xpath, when it fails then find content by string.
 
Back
Top