Regular Expression extract all links in a page.  
Author Message
smartestdesign





PostPosted: 2006-8-7 8:12:00 Top

java-programmer, Regular Expression extract all links in a page. I am trying to extract all urls for a perticular page, but without a
success.

java.util.regex.Pattern p = Pattern.compile("<a
href=\"http://(.*)\">",Pattern.MULTILINE);
java.util.regex.Matcher m = p.matcher(strhtmpage);
while ( m.find() )
{
System.out.println( "LINKS: " + m.group(1) );
}

 
lordy





PostPosted: 2006-8-7 9:20:00 Top

java-programmer >> Regular Expression extract all links in a page. On 2006-08-07, email***@***.com <email***@***.com> wrote:
> I am trying to extract all urls for a perticular page, but without a
> success.
>
> java.util.regex.Pattern p = Pattern.compile("<a
> href=\"http://(.*)\">",Pattern.MULTILINE);
> java.util.regex.Matcher m = p.matcher(strhtmpage);
> while ( m.find() )
> {
> System.out.println( "LINKS: " + m.group(1) );
> }
>

Your ".*" is greedy by default. You want a reluctant matcher. Or use
something like [^"]* instead. (Which will be more efficient).

Read Javadoc or perlre to understand greedy regexps and all will become
clear.

Lordy