PDFs to TXTs in Ubuntu
- Obtain the appropriate packages and command "pdftotext" from the Ubuntu libraries via the command:
sudo apt-get install poppler-utils
Ensure that the package installs correctly before attempting to use it. - Learn how the pdftotext command works and familiarize yourself with the command line options available. Look at the man page for the command entering "man pdftotext" at the command line shell prompt, and hit "Enter". The command line options consist of letters, prefixed by a dash, such as "-l", and they all provide different functions.
The standard command for pdftotext is "pdftotext <pdffile> <textfile>" (without quotes) where <pdffile> is the name of the PDF file to extract, such as "report.pdf" and <textfile> is the name of the text output file, such as "report.txt". You can use any name of your choice. - Test the command by trying it on a few PDF files individually. If it is okay you may want to try using it on a number of PDF files in shell scripts to automate the process. An example of a typical script is shown below:
for i in *.pdf
do
pdftotext $i $i.txt
done
This script takes all of the PDF files in the current directory and exports them with their name to a text file, so "report.pdf" would become "report.pdf.txt" - Some PDFs are protected either with passwords or set up to prevent export of text from the document. This is an attempt to protect copyright and if this is the case perhaps you had better reconsider the conversion from a legal perspective. If you have the password for a PDF file, this can be passed in the command line options for "pdftotext".