1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

What kind of software I'm looking for if want extract title, body, comments form html?

Discussion in 'General Programming Chat' started by jasom, Jan 8, 2013.

  1. jasom

    jasom Newbie

    Joined:
    Jan 15, 2011
    Messages:
    2
    Likes Received:
    3
    Hello,
    what kind of software I'm looking for, if want extract title, body, comments form html?

    This is scenario:
    1. I have list of urls, obtained for example from sitemap.xml
    2. I want open a page by page and extract title, body text, comment's author, time, body, email into database in my localhost or webserver

    What I am looking for? What are keywords I would try to google for? I can work with php.
     
  2. s0ap

    s0ap Executive VIP Jr. VIP Premium Member

    Joined:
    Sep 23, 2008
    Messages:
    230
    Likes Received:
    822
    Occupation:
    :] guess
    Location:
    Congo/DRC
    If you already have an enumerated list of URLs, I would load them up into an array and let it loose with as many threads/processes as you are comfortable with.

    HTML is a plain-text markup language so you should be able to hack together a regex statement to extract the data you are looking for. You could probably even do it with sed and a bash script if you are working with Linux.
     
  3. Zapdos

    Zapdos Power Member

    Joined:
    Oct 22, 2011
    Messages:
    597
    Likes Received:
    708
    Location:
    Eastern North Carolina
    phphtmlsimpledom
    http://simplehtmldom.sourceforge.net/

    Selectors like jquery and you can easily pull out values. Just create a simple program where it scans the entire page for links, records them. Then have it pull the information needed. Once done, it opens up the recorded pages and repeats until it's out of URLs. You could use curl/fopen for retrieving content. MySQL to store links and data.

    I have it integrated into a site search engine which does a few thousand pages and haven't had a problem.
     
    • Thanks Thanks x 1
    Last edited: Jan 9, 2013
  4. Question

    Question Registered Member

    Joined:
    Aug 14, 2011
    Messages:
    51
    Likes Received:
    32
    • Thanks Thanks x 1
  5. botrockets

    botrockets Regular Member

    Joined:
    Mar 16, 2013
    Messages:
    355
    Likes Received:
    551
    Gender:
    Male
    Occupation:
    Entrepreneur
    Location:
    BotRockets
    you can do that with javascript and phantomjs