Searching and replacing multiple strings in text  
Author Message
mvppetlab





PostPosted: 2003-10-10 6:37:00 Top

java-programmer, Searching and replacing multiple strings in text Suppose I have a String like:

"The quick brown fox jumped over the lazy dog's back."

Suppose I wanted to make the following multiple replacements in the
String:

"quick" -> "slow"
"brown" -> "red"
"jump" -> "walk"
"over" -> "under"
"back" -> "stomach"

After making the replacements, the String becomes:

"The slow red fox walked under the lazy dog's stomach."

It would be easy to just scan the string N times if we're replacing N
words but I'd like a way that scans the string only once regardless of
how many words we're replacing. I'm sure tons of programmers have had
need for this and I'm wondering if there's already some existing code
on the net for it.

If not, then what are the best data structures and classes to use for
this? For example, should the collection of words to replace just be a
String[] and the collection of replacements just be a String[], so
that e.g. wordsToReplace[i] is replaced by replacement[i], and then
you do the search and replace algorithm something loosely along the
idea of:

0. assume an input String, an output StringBuffer
1. set c = current input text character
2. if c is the first character of any of the words in wordsToReplace[]
then
2a. check if the next input character is the next character of
any of the words in our input string and so forth....
if the input word is a match, append the replacement
to the output StringBuffer.
otherwise, append the character c from step 1 to the
StringBuffer and
go back to step 1. with the input character
advanced to the next character.

I'm being very sloppy and crude here and only trying to illustrate the
idea, because actually programming the above function while making
sure you handle all the out-of-bounds cases, etc., would be tricky and
time-consuming, and I want to post here for a pre-canned solution or
something easy I haven't thought of yet before going to all the
trouble.

Thanks very much for any replies.

Chris
 
Daniel Sj鯾lom





PostPosted: 2003-10-10 23:51:00 Top

java-programmer >> Searching and replacing multiple strings in text Chris wrote:

> It would be easy to just scan the string N times if we're replacing N
> words but I'd like a way that scans the string only once regardless of
> how many words we're replacing. I'm sure tons of programmers have had
> need for this and I'm wondering if there's already some existing code
> on the net for it.
>
> If not, then what are the best data structures and classes to use for
> this? For example, should the collection of words to replace just be a
> String[] and the collection of replacements just be a String[], so
> that e.g. wordsToReplace[i] is replaced by replacement[i], and then
> you do the search and replace algorithm something loosely along the
> idea of:
>
> 0. assume an input String, an output StringBuffer
> 1. set c = current input text character
> 2. if c is the first character of any of the words in wordsToReplace[]
> then
> 2a. check if the next input character is the next character of
> any of the words in our input string and so forth....
> if the input word is a match, append the replacement
> to the output StringBuffer.
> otherwise, append the character c from step 1 to the
> StringBuffer and
> go back to step 1. with the input character
> advanced to the next character.
>

This is a horribly inefficient algorithm, almost as bad as looping
through the string multiple times. You should use an algorithm that is
designed to find multiple strings in one pass, instead of trying to
modify a brute-force algorithm to do what it is not supposed to do. A
finite state machine will find multiple strings at the cost of one. You
need to build the FSM of course, which takes some time, but you only
need to do it once. The searching is done in linear time. Try googling
for Aho Corasick (or Aho Korasick) bibliographic search algorithm for
more information.

--
Daniel Sj鯾lom

 
Eric Sosman





PostPosted: 2003-10-11 3:51:00 Top

java-programmer >> Searching and replacing multiple strings in text Roedy Green wrote:
>
> On 9 Oct 2003 15:37:28 -0700, email***@***.com (Chris) wrote or
> quoted :
>
> >Suppose I wanted to make the following multiple replacements in the
> >String:
> >
> >"quick" -> "slow"
> >"brown" -> "red"
> >"jump" -> "walk"
> >"over" -> "under"
> >"back" -> "stomach"
>
> We had this question a few weeks back and many different solutions
> were posted. I suggested creating a Hashtable of the word pairs to be
> replaced.
>
> Then use a regex to split the sentence into words. Look up each word,
> and if you find it, append the transform to a StringBuffer. If not
> append the original word (and a space).

This wouldn't quite work in the present instance, because
of the O.P.'s illustrative transformation:

> "The quick brown fox jumped over the lazy dog's back."
> "The slow red fox walked under the lazy dog's stomach."

Note the change of "jumped" to "walked" even though
this pair of words is not present in the replacement list.

--
email***@***.com
 
 
christopher_r_barry_nospam





PostPosted: 2003-10-11 13:12:00 Top

java-programmer >> Searching and replacing multiple strings in text Daniel Sj鯾lom <email***@***.com> wrote in message news:<3f86d5e5$0$16834$email***@***.com>...

> This is a horribly inefficient algorithm, almost as bad as looping
> through the string multiple times. You should use an algorithm that is
> designed to find multiple strings in one pass, instead of trying to
> modify a brute-force algorithm to do what it is not supposed to do. A
> finite state machine will find multiple strings at the cost of one. You
> need to build the FSM of course, which takes some time, but you only
> need to do it once. The searching is done in linear time. Try googling
> for Aho Corasick (or Aho Korasick) bibliographic search algorithm for
> more information.

I did actually find mention of the Aho-Corasick algorithm last night
and also the Commentz-Walter algorithm which achieves the same thing.

I have found Java implementations of neither though. Commentz-Walter
is supposed to be substantially faster than Aho-Corasick in practice.
There's a C implementation of Commentz-Walter in GNU fgrep 2.0 and
above that I could use as a starting point but it still would be a
waste of time if there's already a Java implementation floating around
the net.

So my original question about not reinventing the wheel and going to
all the effort to get an efficient multiple-search and replace still
stands.

---Chris