[Python-au] PDF's and pycon-au

chris_g at netspace.net.au chris_g at netspace.net.au
Mon Jul 25 03:52:20 UTC 2011


Hi Azerith,

Recently I had a very similar problem.
I turned the pdf into html using pdftohtml* and then did all the parsing in
python with the help of beautifulsoup .
Admittedly this isn't a pure python solution, but pdftohtml is easy enough to
call from python.

* http://pdftohtml.sourceforge.net/ - this is prepackaged for a lot of linux
distributions & windows versions are also available.

Cheers,
Chris Guest

Quoting Azerith <azerith at gmail.com>:

> Hi all, is there anyone on here who has experance dealing with PDF's in
> python. Specificaly extracting text from rather badly formatted pdf's.
> 
> If so, yay could i rack your brains at some point?
> 
> Also if you are going to pycon-au could we grab a coffee?
> 
> The long and short is I ask trying to automate the extraction of part of a
> pdf doc based on the type of job note. After that I want to spell check it
> and one day I'd like to use NLTK to summarise the notes.
> 
> Ambitious much? Any suggestions I'd be grateful.
> 





------------------------------------------------------------
This email was sent from Netspace Webmail: http://www.netspace.net.au




More information about the python-au mailing list