[Python-au] PDF's and pycon-au
chris_g at netspace.net.au
chris_g at netspace.net.au
Mon Jul 25 03:52:20 UTC 2011
Hi Azerith,
Recently I had a very similar problem.
I turned the pdf into html using pdftohtml* and then did all the parsing in
python with the help of beautifulsoup .
Admittedly this isn't a pure python solution, but pdftohtml is easy enough to
call from python.
* http://pdftohtml.sourceforge.net/ - this is prepackaged for a lot of linux
distributions & windows versions are also available.
Cheers,
Chris Guest
Quoting Azerith <azerith at gmail.com>:
> Hi all, is there anyone on here who has experance dealing with PDF's in
> python. Specificaly extracting text from rather badly formatted pdf's.
>
> If so, yay could i rack your brains at some point?
>
> Also if you are going to pycon-au could we grab a coffee?
>
> The long and short is I ask trying to automate the extraction of part of a
> pdf doc based on the type of job note. After that I want to spell check it
> and one day I'd like to use NLTK to summarise the notes.
>
> Ambitious much? Any suggestions I'd be grateful.
>
------------------------------------------------------------
This email was sent from Netspace Webmail: http://www.netspace.net.au
More information about the python-au
mailing list