[triangle-zpug] Regular expressions hanging or just taking a looonng time

Edmund Moseley edmund at unc.edu
Mon Aug 28 15:49:23 CEST 2006


Thanks a lot for the advice, Philip and Adam.
I will try tinkering with these ideas.

Thanks again,

Edmund

Quoting Philip Semanchuk <philip at semanchuk.com>:

>
> On Aug 25, 2006, at 10:03 AM, Edmund Moseley wrote:
>
>> Hi all,
>>
>> I am writing a method to take a word perfect file and use RE to 
>> parse out the data.  I have also got a very simple unittest which 
>> tries it out on a few different files.
>
> Hi Edmund,
> From the Thinking-Outside-The-Box Dept: OpenOffice can read Word 
> Perfect files and save them to just about anything you like, 
> including XML-based OOo format.
>
>> The regex is pretty long and basically looks for a field name, then 
>> captures everything after it, until the next field name. Sample:
>>
>> pattern = re.compile(r"""
>>      NAME:              # look for name label
>>      (?P<name>.*?)      # capture name
>>      AGE:               # look for age label
>>      (?P<age>.*?)       # capture age
>>      RACE:              # look for race label
>>      (?P<race>.*?)      # capture race
>>      .
>>      .
>>      .
>>      """, re.VERBOSE | re.DOTALL)
>>
>> The actual pattern is much longer and as I develop it, if I make 
>> slight mistakes it seems to cause it to hang.
>
> When you say cause "it" to hang, do you mean compilation or execution?
>
>
>>   However, ctrl-C or ctrl-D won't break out of it. A few web 
>> searches suggested that it is not hung, but instead just taking a 
>> really long time. I've tried waiting for it over lunch, but nothing 
>> happens. I must quit the terminal and start again.
>> So, I was wondering: Would it be adviseable for me to add a time 
>> limit to my test? If so, how?
>
> I'd start by limiting the input. Whack your input file down to 1% of 
> its original size and see if you get the same behavior. If not, then 
> maybe start doubling it: 2% of the original, 4%, 8%, etc. and see if 
> your "hang time" grows along with the filesize. If the RE executes 
> speedily up to 4% and then zooms to infinite at 8%, then perhaps 
> there's a byte pattern in the 4-8% range that's giving you fits.
>
> Or work from the other end -- dramatically simplify your regex, and 
> then add to it bit by bit and watch the performance as you go.
>
>
>> Am I doing something rather wrong with my reg ex?
>
> I am a Grade A regex novice, so my advice is guaranteed only to be 
> worth what you've paid for it. RE syntax is a programming language 
> all its own, and computer programs (especially ones written in 
> cryptic syntax like RE syntax) can do very unexpected things. There's 
> certainly something "wrong" in that you are not satisfied with your 
> results, but it is impossible to tell at this point if the problem is 
> a syntax error, a logical flaw or simply unreasonable expectations 
> (perhaps you forgot to mention that your input file is 1 terabyte =)  
> ).
>
> HTH
> Philip
>
>
> _______________________________________________
> triangle-zpug mailing list
> triangle-zpug at starship.python.net
> http://starship.python.net/mailman/listinfo/triangle-zpug
>



More information about the triangle-zpug mailing list