1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Extract text from a PDF file

Discussion in 'Black Hat SEO Tools' started by Viltedali, Apr 7, 2013.

  1. Viltedali

    Viltedali Regular Member

    Joined:
    Feb 10, 2008
    Messages:
    305
    Likes Received:
    32
    Location:
    Midwest-US
    I am wondering what would be the best way to extract text from a PDF file.

    The PDF files are forms that have fields where someone enters text (typed on a computer, not handwritten), and I want to extract the textthat they enter into the fields.

    Preferably something that would run on a Linux server, if possible.

    I looked around and found itextsharp, has anyone used it and would it do the job?

    If so, I guess you can not install that on a Linux server?

    Thanks in advance for any replies
     
  2. rogerebert

    rogerebert Registered Member

    Joined:
    Feb 11, 2010
    Messages:
    78
    Likes Received:
    14
    I've used the desktop version, but they do make server based apps as well

    Able2Extract Server
    http://www.investintech.com/purchase/server/[/URL]
     
    • Thanks Thanks x 1
  3. Viltedali

    Viltedali Regular Member

    Joined:
    Feb 10, 2008
    Messages:
    305
    Likes Received:
    32
    Location:
    Midwest-US
    Wow, it looks like the server version is $5,000, way out of my range.
     
  4. arronlee

    arronlee Newbie

    Joined:
    May 2, 2013
    Messages:
    1
    Likes Received:
    0
    More precisely, reading the PDF into a character recognition (OCR) software, if your PDF is an all graphics file (indicated by the impossibility of highlighting text).



    The results of course depend on your OCR software and the settings you apply before recognition.



    In any case, the procedure is likely to involve a lot of work and only pays off if the text contains lots of repetitions and you can use a CAT software afterwards. Otherwise, just use a printout and type the translation into Word
     
  5. blackhat777

    blackhat777 Elite Member

    Joined:
    Jun 25, 2011
    Messages:
    1,779
    Likes Received:
    653
    Send the file to me, I will see if I can do something.
     
  6. Plousrt

    Plousrt Newbie

    Joined:
    May 2, 2013
    Messages:
    15
    Likes Received:
    4
    Try able extractor
     
  7. JustUs

    JustUs Power Member

    Joined:
    May 6, 2012
    Messages:
    609
    Likes Received:
    451
    The something you want is Adobe Acrobat. Acrobat does not run on Linux though. I have also read that PDF Studio Pro may be the tool you want, and it does run on Linux, Mac, and Windows. PDF Studio does not have OCR.
     
  8. SwiffJustus

    SwiffJustus Junior Member

    Joined:
    Feb 22, 2013
    Messages:
    136
    Likes Received:
    97
    Occupation:
    $$$$$-MAKER
    Location:
    THE LAB
    At the risk of looking foolish couldn't you just use something like this..
    again I'm probably waaayy Of -lol

    This is a Online tool.
    hxxp://mmm(dot)pdfonline(dot)com/pdf-to-word-converter/default-b.aspx?utm_expid=127285-38

    Have to make the obvious adjustments as i cant post links.
     
    Last edited: May 3, 2013
  9. mogaz

    mogaz Jr. VIP Jr. VIP

    Joined:
    Apr 23, 2013
    Messages:
    968
    Likes Received:
    56
    Cant you just edit the pdf file and take the text, that might be much easier.
     
  10. SwiffJustus

    SwiffJustus Junior Member

    Joined:
    Feb 22, 2013
    Messages:
    136
    Likes Received:
    97
    Occupation:
    $$$$$-MAKER
    Location:
    THE LAB
    Also You can also Google "PDFEdit" it's a open source editor that runs on Linux.
    Just trying to throw a few options out there is all.
     
    Last edited: May 3, 2013
  11. Sjergsen

    Sjergsen Newbie

    Joined:
    Nov 13, 2013
    Messages:
    1
    Likes Received:
    0
    hey man, i am currently creating a pdf to text converter myself, which should look something like when its done
    still looking for something special to implement, maybe a keyword finder
    anyway if you have suggestions im open for them :D
     
  12. ButcherBoy

    ButcherBoy Regular Member

    Joined:
    Apr 3, 2009
    Messages:
    390
    Likes Received:
    79
    Location:
    Planet E.
    If PDF is not scan image, you could simply use any pdf->doc converter?

    About linux version, I had big success with pdftotex tool.
     
    Last edited: Nov 13, 2013
  13. mickyfu

    mickyfu Jr. VIP Jr. VIP Premium Member

    Joined:
    Dec 14, 2011
    Messages:
    5,278
    Likes Received:
    15,454
    Location:
    Jennifers Office.
    Nitro PDF is one of the best for editing and stripping PDF's. You can find it for free.
     
  14. RedMango

    RedMango Power Member

    Joined:
    Jul 15, 2010
    Messages:
    518
    Likes Received:
    201
    Location:
    UK
    Code:
    [URL]http://www.cometdocs.com/[/URL]
     
  15. ija61

    ija61 Senior Member

    Joined:
    Mar 2, 2011
    Messages:
    960
    Likes Received:
    634
    Gender:
    Male
    Occupation:
    The first SEO economist:)
    Location:
    Romania
    Home Page:
    Hy.

    Before starting to make comments I would like to ask you to give more details.

    You need to make this on autopilot or you just need some file to be converted?
     
  16. mortenb

    mortenb Newbie

    Joined:
    May 21, 2009
    Messages:
    9
    Likes Received:
    0
    Apache Tika can extract text from PDF's and loads of other file formats and it is open source. It is written in java, so it can run on pretty much any platform.
    I can't post links yet, but do a google search for Apache Tika and it should show up as the first result