PDFs to TXTs in Ubuntu

104 217

    Obtaining pdftotext

    • Obtain the appropriate packages and command "pdftotext" from the Ubuntu libraries via the command:

      sudo apt-get install poppler-utils

      Ensure that the package installs correctly before attempting to use it.

    pdftotext Man Page

    • Learn how the pdftotext command works and familiarize yourself with the command line options available. Look at the man page for the command entering "man pdftotext" at the command line shell prompt, and hit "Enter". The command line options consist of letters, prefixed by a dash, such as "-l", and they all provide different functions.

      The standard command for pdftotext is "pdftotext <pdffile> <textfile>" (without quotes) where <pdffile> is the name of the PDF file to extract, such as "report.pdf" and <textfile> is the name of the text output file, such as "report.txt". You can use any name of your choice.

    Batch PDF Conversion

    • Test the command by trying it on a few PDF files individually. If it is okay you may want to try using it on a number of PDF files in shell scripts to automate the process. An example of a typical script is shown below:

      for i in *.pdf

      do

      pdftotext $i $i.txt

      done

      This script takes all of the PDF files in the current directory and exports them with their name to a text file, so "report.pdf" would become "report.pdf.txt"

    Protected PDF Files

    • Some PDFs are protected either with passwords or set up to prevent export of text from the document. This is an attempt to protect copyright and if this is the case perhaps you had better reconsider the conversion from a legal perspective. If you have the password for a PDF file, this can be passed in the command line options for "pdftotext".

Source...
Subscribe to our newsletter
Sign up here to get the latest news, updates and special offers delivered directly to your inbox.
You can unsubscribe at any time

Leave A Reply

Your email address will not be published.