1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Recognize sentences

Discussion in 'General Programming Chat' started by Pfuesch, Sep 2, 2009.

  1. Pfuesch

    Pfuesch Newbie

    Joined:
    Jul 22, 2009
    Messages:
    10
    Likes Received:
    1
    Hey Guys!

    I'm trying to build a content scraper (in php) which is scraping keyword relevant websites, put different sentences together and put it into a blog. Nothing new so far...

    What I did until now is:
    1. Get google results for keyword.
    2. Get the website.
    3. Strip all the HTML Tags. Now, there are just Words. Some of them are sentences, some are just mixed together words from navigations and header and footer and stuff...

    Now here's my Problem: How can I get only the sentences? I do not want to have all the words of the navigation in it...

    Any idea?

    EDIT:
    Here's an example of the actual output. The parts are divided by delimiter '.'. Part one and two are crappy words but I like part three, four and five... How can I recognize that they're sentences?

    EXAMPLE:
    SENTENCE?? loadingElement { width: 100%; height: 100%; position: absolute; left: 0; top: 0; background-color: #000; background-repeat: no-repeat; background-position: center center; background-image: url('wp-content/plugins/featured-content-gallery/css/img/loading-bar-black
    SENTENCE?? { 2 comments } September 2009 M T W T F S S « Feb 123456 78910111213 14151617181920 21222324252627 282930 Blogroll Development Blog Documentation Plugins Suggest Ideas Support Forum Themes WordPress Planet Tags Recent Comments ekedigusih on Hello world
    SENTENCE?? You can edit the content that appears here by visiting your Widgets panel and modifying the current widgets in Sidebar 2
    SENTENCE?? Or, if you want to be a true ninja, you can add your own content to this sidebar by using the appropriate hooks
    SENTENCE?? Get smart with the Thesis WordPress Theme from DIY Themes
     
    Last edited: Sep 2, 2009
  2. heiska

    heiska Junior Member

    Joined:
    Dec 5, 2008
    Messages:
    138
    Likes Received:
    169
    Compare contents of each (div/table) tag against the search query used to locate the site. If a match is found, you have found your div tag which should contain the content. Also remember to strip eg. javascript (in order to avoid google ads/unrelated content in your article).

    Not a bulletproof solution but the best I could think of in a minute.
     
    • Thanks Thanks x 1
  3. Pfuesch

    Pfuesch Newbie

    Joined:
    Jul 22, 2009
    Messages:
    10
    Likes Received:
    1
    Thanks for the advice, heiska!

    What I came up with yesterday:

    1. Just allow a-z A-Z 0-9 , ! . ? -
    If there's any other character in it, it's not a sentence! This will filter out some correct sentences but works quite good...

    2. Check for the length and the number of spaces in it.

    3. Only grab content between p-html-tags!

    The results are pretty good now...