buggy regexp  
Author Message
ilyabo





PostPosted: 2007-5-31 22:34:00 Top

java-programmer, buggy regexp Hello all!

The following line hangs in Java:

Pattern.matches("((<BR>)|([^<>]+))*", "aaaaaaaaaaaaaaaaaaaaaaaa
<BR><Bx>")

What's wrong with the regular expression?

Thanks in advance
Ilya

 
Robert Klemme





PostPosted: 2007-5-31 22:41:00 Top

java-programmer >> buggy regexp On 31.05.2007 16:33, email***@***.com wrote:
> Hello all!
>
> The following line hangs in Java:
>
> Pattern.matches("((<BR>)|([^<>]+))*", "aaaaaaaaaaaaaaaaaaaaaaaa
> <BR><Bx>")
>
> What's wrong with the regular expression?

You nest "+" and "*" which can lead to bad backtracking (which you seem
to experience). If you wait long enough you'll see the result.

If you just want to test for the presence of "<BR>" you can do this:

boolean match = str.indexOf("<BR>") != -1;

Or as regexp

Pattern p = Pattern.compile("<BR>");
boolean match = p.matcher(s).find();

Kind regards

robert
 
ilyabo





PostPosted: 2007-6-1 1:43:00 Top

java-programmer >> buggy regexp Hello Robert,

> You nest "+" and "*" which can lead to bad backtracking (which you seem
> to experience). If you wait long enough you'll see the result.

You are right it doesn't hang, it just works very slowly - it takes
about two minutes to finish this match. And the more "a"-letters
before the <BR>s are in the input string, the longer it takes.


> If you just want to test for the presence of "<BR>" you can do this:
>
> boolean match = str.indexOf("<BR>") != -1;
>
> Or as regexp
>
> Pattern p = Pattern.compile("<BR>");
> boolean match = p.matcher(s).find();

I want actually to check the validness of an HTML string where only
the cerain tags are allowed. The original expression was:
(?ui)(([^<>]+)|(<\\/?(p|br|em|strong|strike|i|b|ul|ol|li)\\s*\\/?\
\s*>))*
but I reduced it to make the problem clearer.


Interestingly, the same regular expression match works instantly in
Perl:
$_ = "aaaaaaaaaaaaaaaaaaaaaaaa <BR><Bx>";
print "yes" if (m/^((<BR>)|([^<>]+))*$/);

And if I move <Bx> to the beginning of the string or change it to <BR>
then it works instantly also in Java.

Ilya

 
 
ilyabo





PostPosted: 2007-6-1 2:15:00 Top

java-programmer >> buggy regexp And thanks, Robert, removing "+" helps alot, at least, for this input
string. But if I try a longer one, it still takes ages to match it.

Regards
Ilya


> > Pattern.matches("((<BR>)|([^<>]+))*", "aaaaaaaaaaaaaaaaaaaaaaaa
> > <BR><Bx>")
>
> > What's wrong with the regular expression?
>
> You nest "+" and "*" which can lead to bad backtracking (which you seem
> to experience). If you wait long enough you'll see the result.
>

 
 
Oliver Wong





PostPosted: 2007-6-1 5:25:00 Top

java-programmer >> buggy regexp
<email***@***.com> wrote in message
news:email***@***.com...
[snip regexp stuff]
>
> I want actually to check the validness of an HTML string where only
> the cerain tags are allowed. The original expression was:
> (?ui)(([^<>]+)|(<\\/?(p|br|em|strong|strike|i|b|ul|ol|li)\\s*\\/?\
> \s*>))*
> but I reduced it to make the problem clearer.

Note that it's impossible to check the validity of HTML using
"traditional" regular expressions, as HTML is not a regular language. Perl
regular expressions are actually more powerful than traditional regular
expressions; I'm not sure if Perl regular expressions are powerful enough
to recognize context-free languages or not. HTML is a context-free
language.

I think you'll have better luck using a pre-written HTML parser, or at
worst, writing your own full blown parser for checking your strings for
validity.

- Oliver


 
 
ilyabo





PostPosted: 2007-6-1 15:41:00 Top

java-programmer >> buggy regexp Hi Oliver,

in fact I don't want to really check the validity of HTML, but just
ensure that only allowed tags are used. So the logic of the regular
expression should be very simple: if there is something between <>
then it must be one of the predefined tags.

This is surely not impossible: the regular expression which I cited
works in Java too - it just takes too much time in Java, that's the
problem.

Regards
Ilya

> Note that it's impossible to check the validity of HTML using
> "traditional" regular expressions, as HTML is not a regular language. Perl
> regular expressions are actually more powerful than traditional regular
> expressions; I'm not sure if Perl regular expressions are powerful enough
> to recognize context-free languages or not. HTML is a context-free
> language.

 
 
Ingo Menger





PostPosted: 2007-6-1 16:39:00 Top

java-programmer >> buggy regexp On 31 Mai, 20:15, email***@***.com wrote:
> And thanks, Robert, removing "+" helps alot, at least, for this input
> string. But if I try a longer one, it still takes ages to match it.

This is because your RE is a bit silly.
You try to match the whole document, which is going to take long.
I'd do it like this:

p = Pattern.compile("<(\\w+)\\b[^>]*>"); // matches also <A
href="foo">
m = p.matcher(s);
correct = true;
while (correct && m.find()) {
correct = allowed(m.group(1));
}


 
 
ilyabo





PostPosted: 2007-6-1 19:06:00 Top

java-programmer >> buggy regexp Thanks a lot, that's a good hint!

Regards
Ilya

On Jun 1, 10:39 am, Ingo Menger <email***@***.com> wrote:
> p = Pattern.compile("<(\\w+)\\b[^>]*>"); // matches also <A
> href="foo">
> m = p.matcher(s);
> correct = true;
> while (correct && m.find()) {
> correct = allowed(m.group(1));
> }


 
 
Robert Klemme





PostPosted: 2007-6-1 19:35:00 Top

java-programmer >> buggy regexp On 01.06.2007 09:41, email***@***.com wrote:
> in fact I don't want to really check the validity of HTML, but just
> ensure that only allowed tags are used. So the logic of the regular
> expression should be very simple: if there is something between <>
> then it must be one of the predefined tags.

Actually that logic does not work because it won't handle <> in comments
and attributes properly. Using a full blown HTML parser (of which there
are plenty around) is certainly the safer choice.

> This is surely not impossible: the regular expression which I cited
> works in Java too - it just takes too much time in Java, that's the
> problem.

Often, changing approaches is more efficient. Why don't you do

// this regexp should be improved!
private static final Pattern TAG = Pattern.compile("<(\\w+)[^>]*>");
private static final Set valid = createValidTags();

static boolean isValid(CharSequence s) {
Matcher m = TAG.matcher(s);

while ( m.find() ) {
if ( !valid.contains(m.group(1).lowerCase()) ) {
return false;
}
}

return true;
}

Kind regards

robert
 
 
ilyabo





PostPosted: 2007-6-1 20:22:00 Top

java-programmer >> buggy regexp Hello Robert,

> Actually that logic does not work because it won't handle <> in comments
> and attributes properly.

It doesn't have to.

> Often, changing approaches is more efficient. Why don't you do
[code omitted]

Yes, that's just the same as what Ingo suggested and it works well
also on very long input strings.

Thank you!

Best regards
Ilya