1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to extract articles from html files?

Discussion in 'Black Hat SEO Tools' started by mareks, Jan 25, 2015.

  1. mareks

    mareks Regular Member

    Joined:
    Sep 23, 2013
    Messages:
    404
    Likes Received:
    169
    Home Page:
    Hi!

    I am looking for over half day for method/software which can extract articles from custom html files. I have couple sites in HTML and I want put articles from them to new Wordpress sites. So I just need get articles from each page in .txt format and get rid of any html tags...

    Any ideas or maybe there is software for that? ohh and yes, html files are on my computer...

    Thank You!
     
    Last edited: Jan 25, 2015
  2. zagard

    zagard Jr. VIP Jr. VIP

    Joined:
    Oct 26, 2014
    Messages:
    194
    Likes Received:
    171
    Occupation:
    popoom pompomer
    Location:
    poompoom land
    :1: or just double click the html files so it open in browser then copy paste
     
    • Thanks Thanks x 1
  3. Zwielicht

    Zwielicht Moderator Staff Member Moderator Jr. VIP

    Joined:
    Aug 31, 2013
    Messages:
    6,566
    Likes Received:
    11,708
    Gender:
    Male
    Occupation:
    Private Investigator
    Location:
    Riverside, California
    Home Page:
    And here I was trying to overcomplicate it. :lmao:
     
  4. Tobbe co

    Tobbe co Junior Member

    Joined:
    Sep 29, 2014
    Messages:
    171
    Likes Received:
    139
    You can mass edit those articles in notepad++. Use regex to replace everything between "<" to ">".
    Just like any webrowser then? xD
     
  5. Zwielicht

    Zwielicht Moderator Staff Member Moderator Jr. VIP

    Joined:
    Aug 31, 2013
    Messages:
    6,566
    Likes Received:
    11,708
    Gender:
    Male
    Occupation:
    Private Investigator
    Location:
    Riverside, California
    Home Page:
    Yes, I read "software" and "HTML" and thought of Dreamweaver before opening the HTML file in the browser.
     
  6. mareks

    mareks Regular Member

    Joined:
    Sep 23, 2013
    Messages:
    404
    Likes Received:
    169
    Home Page:
    Yeah, I have used notepad++ for mass edit, but I am looking for more automated method :) ;)

    I need smart extractor which can get me .txt files just with title and body of article, so I can mass post wp sites...
     
    Last edited: Jan 26, 2015
  7. Zwielicht

    Zwielicht Moderator Staff Member Moderator Jr. VIP

    Joined:
    Aug 31, 2013
    Messages:
    6,566
    Likes Received:
    11,708
    Gender:
    Male
    Occupation:
    Private Investigator
    Location:
    Riverside, California
    Home Page:
    I figured there was more to your question that what you mentioned in your original post.

    There's a program called BoilerPipe that extracts the articles from the website, although I don't believe they have a bulk option.

    Another program called Purifyr does something similar, although it ma not have a bulk option as well.

    I do not know of any programs which can scrape mass articles from the pages and then create text documents for them, although I'll let you know if I find any.
     
    • Thanks Thanks x 3
  8. handmadebots

    handmadebots Senior Member

    Joined:
    Nov 8, 2012
    Messages:
    960
    Likes Received:
    216
    Home Page:
    I can make you a software that extracts the articles from html files.
    Grab all the text (text, no html tags) between <body> and </body> or <div> depending on how the website's source looks like :)
    If you're still looking, let me know.
     
    • Thanks Thanks x 1
  9. Asif WILSON Khan

    Asif WILSON Khan Executive VIP Jr. VIP

    Joined:
    Nov 10, 2012
    Messages:
    11,448
    Likes Received:
    32,366
    Gender:
    Male
    Occupation:
    Fun Lovin' Criminal
    Location:
    London
    Home Page:
    • Thanks Thanks x 2
  10. mareks

    mareks Regular Member

    Joined:
    Sep 23, 2013
    Messages:
    404
    Likes Received:
    169
    Home Page:
    Thank You very much! Yeah I didn't said what I tried, sorry :D

    Thank You very much! ;) I will test it. :)


    -------------------------------------------------------

    Right now best what I got was with notepad++ and CMD commands. But I will post if I will get better results with any other method/soft. ;)
     
  11. mareks

    mareks Regular Member

    Joined:
    Sep 23, 2013
    Messages:
    404
    Likes Received:
    169
    Home Page:
    Okay, update:

    Tried convertor html to txt, good idea, but needs more custom visual code analysis, like check which tags to keep, so tags used around articles are kept with articles. Then it will work, didn't found any software which can do like that.

    Checked Purifyr and Boilerpipe, also doesn't solve this. So notepad++ and CMD commands is the best way to go, or make custom soft... But there is no point right now for custom software, at least for me.

    If You know good converter which can understand which tags to keep in html, then it will solve everything, but please only for multiple files, i have a lot pages... ;)

    Thank You all!
     
  12. Asif WILSON Khan

    Asif WILSON Khan Executive VIP Jr. VIP

    Joined:
    Nov 10, 2012
    Messages:
    11,448
    Likes Received:
    32,366
    Gender:
    Male
    Occupation:
    Fun Lovin' Criminal
    Location:
    London
    Home Page:
  13. mareks

    mareks Regular Member

    Joined:
    Sep 23, 2013
    Messages:
    404
    Likes Received:
    169
    Home Page:
    Thank You buddy!

    sourceforge .net/projects/html2text/ - not working
    jafsoft .com/detagger/ - works but just strip tags, leaving unnecessary text.
    softinterface .com/Convert-Doc/Features/Convert-HTML-To-TEXT.htm - just converts.

    I think only way to automate it and make it simple, is to make soft... :) Will get coder make software when there will be bigger need, I will be fine with notepad++ right now... ;)

    Thank You very much for help!
     
  14. therecipe

    therecipe Newbie

    Joined:
    Dec 18, 2014
    Messages:
    11
    Likes Received:
    3
    Take a look at this site:
    Code:
    templates.mailchimp.cXm/resources/html-to-text/
     
  15. mareks

    mareks Regular Member

    Joined:
    Sep 23, 2013
    Messages:
    404
    Likes Received:
    169
    Home Page:

    Thanks, but it's not what can do that job. :)
     
  16. Atomic76

    Atomic76 Registered Member

    Joined:
    May 24, 2014
    Messages:
    67
    Likes Received:
    37
    You might be able to do this with the SEOTools plugin for Excel, by Neils Bosma. The plugin is free, and can extract content from web pages rather easily into an Excel cell, with the functions XPathOnURL and HTMLFirst. I've not tried it on local files, but I would imagine it should work since you would just be using the path to the file on your hard drive in place of a URL. If it works, the only other thing you would probably need is a small freeware tool that can give you a list of all the files in a folder, so you can paste that list into Excel.

    The other one I might look into is WebHarvy, which might work on local files as well. It's not free but there are pirated copies out there. In fact, there may be some copies shared on these forums if you look around.
     
    • Thanks Thanks x 1
  17. Ambitious12

    Ambitious12 Elite Member

    Joined:
    Jun 26, 2014
    Messages:
    3,097
    Likes Received:
    608
    Occupation:
    No Occupation
    Location:
    Among the Stars
    The best what I can suggest is use Purifyr :) as it is mentioned above,it will work the best for you.
    Good Luck :)
     
    • Thanks Thanks x 1
  18. mareks

    mareks Regular Member

    Joined:
    Sep 23, 2013
    Messages:
    404
    Likes Received:
    169
    Home Page:

    Thanks Webharvyy seems can do the job, but not for files on computer. Notepad++ seems the best, I already converted both sites into txt file articles. :)

    Thanks, but there is only api and site is down. I already done the work... :)
     
  19. spiritfly

    spiritfly Regular Member

    Joined:
    Apr 30, 2011
    Messages:
    265
    Likes Received:
    116
    You can try this plugin for wordpress: https://wordpress.org/plugins/import-html-pages/

    I've used it before a few times with success. If your html files are local, maybe you could install wamp and run wordpress on localhost so you can do this. It looks complicated, but it does work well if you set the plugin right.
     
    • Thanks Thanks x 1
  20. coolndre

    coolndre Newbie

    Joined:
    Jan 8, 2015
    Messages:
    16
    Likes Received:
    0
    Yeah, I'm looking for a software like that too