[triangle-zpug] Regular expressions hanging or just taking a
philip at semanchuk.com
Fri Aug 25 23:54:19 CEST 2006
On Aug 25, 2006, at 10:03 AM, Edmund Moseley wrote:
> Hi all,
> I am writing a method to take a word perfect file and use RE to parse
> out the data. I have also got a very simple unittest which tries it
> out on a few different files.
From the Thinking-Outside-The-Box Dept: OpenOffice can read Word
Perfect files and save them to just about anything you like, including
XML-based OOo format.
> The regex is pretty long and basically looks for a field name, then
> captures everything after it, until the next field name. Sample:
> pattern = re.compile(r"""
> NAME: # look for name label
> (?P<name>.*?) # capture name
> AGE: # look for age label
> (?P<age>.*?) # capture age
> RACE: # look for race label
> (?P<race>.*?) # capture race
> """, re.VERBOSE | re.DOTALL)
> The actual pattern is much longer and as I develop it, if I make
> slight mistakes it seems to cause it to hang.
When you say cause "it" to hang, do you mean compilation or execution?
> However, ctrl-C or ctrl-D won't break out of it. A few web searches
> suggested that it is not hung, but instead just taking a really long
> time. I've tried waiting for it over lunch, but nothing happens. I
> must quit the terminal and start again.
> So, I was wondering: Would it be adviseable for me to add a time limit
> to my test? If so, how?
I'd start by limiting the input. Whack your input file down to 1% of
its original size and see if you get the same behavior. If not, then
maybe start doubling it: 2% of the original, 4%, 8%, etc. and see if
your "hang time" grows along with the filesize. If the RE executes
speedily up to 4% and then zooms to infinite at 8%, then perhaps
there's a byte pattern in the 4-8% range that's giving you fits.
Or work from the other end -- dramatically simplify your regex, and
then add to it bit by bit and watch the performance as you go.
> Am I doing something rather wrong with my reg ex?
I am a Grade A regex novice, so my advice is guaranteed only to be
worth what you've paid for it. RE syntax is a programming language all
its own, and computer programs (especially ones written in cryptic
syntax like RE syntax) can do very unexpected things. There's certainly
something "wrong" in that you are not satisfied with your results, but
it is impossible to tell at this point if the problem is a syntax
error, a logical flaw or simply unreasonable expectations (perhaps you
forgot to mention that your input file is 1 terabyte =) ).
More information about the triangle-zpug