[triangle-zpug] Regular expressions hanging or just taking a looonng time

Philip Semanchuk philip at semanchuk.com
Fri Aug 25 23:54:19 CEST 2006

On Aug 25, 2006, at 10:03 AM, Edmund Moseley wrote:

> Hi all,
> I am writing a method to take a word perfect file and use RE to parse 
> out the data.  I have also got a very simple unittest which tries it 
> out on a few different files.

Hi Edmund,
 From the Thinking-Outside-The-Box Dept: OpenOffice can read Word 
Perfect files and save them to just about anything you like, including 
XML-based OOo format.

> The regex is pretty long and basically looks for a field name, then 
> captures everything after it, until the next field name. Sample:
> pattern = re.compile(r"""
>      NAME:              # look for name label
>      (?P<name>.*?)      # capture name
>      AGE:               # look for age label
>      (?P<age>.*?)       # capture age
>      RACE:              # look for race label
>      (?P<race>.*?)      # capture race
>      .
>      .
>      .
>      """, re.VERBOSE | re.DOTALL)
> The actual pattern is much longer and as I develop it, if I make 
> slight mistakes it seems to cause it to hang.

When you say cause "it" to hang, do you mean compilation or execution?

>   However, ctrl-C or ctrl-D won't break out of it. A few web searches 
> suggested that it is not hung, but instead just taking a really long 
> time. I've tried waiting for it over lunch, but nothing happens. I 
> must quit the terminal and start again.
> So, I was wondering: Would it be adviseable for me to add a time limit 
> to my test? If so, how?

I'd start by limiting the input. Whack your input file down to 1% of 
its original size and see if you get the same behavior. If not, then 
maybe start doubling it: 2% of the original, 4%, 8%, etc. and see if 
your "hang time" grows along with the filesize. If the RE executes 
speedily up to 4% and then zooms to infinite at 8%, then perhaps 
there's a byte pattern in the 4-8% range that's giving you fits.

Or work from the other end -- dramatically simplify your regex, and 
then add to it bit by bit and watch the performance as you go.

> Am I doing something rather wrong with my reg ex?

I am a Grade A regex novice, so my advice is guaranteed only to be 
worth what you've paid for it. RE syntax is a programming language all 
its own, and computer programs (especially ones written in cryptic 
syntax like RE syntax) can do very unexpected things. There's certainly 
something "wrong" in that you are not satisfied with your results, but 
it is impossible to tell at this point if the problem is a syntax 
error, a logical flaw or simply unreasonable expectations (perhaps you 
forgot to mention that your input file is 1 terabyte =)  ).


