Extract all dates from text document  
Author Message
christryp





PostPosted: 2004-8-10 17:20:00 Top

java-programmer, Extract all dates from text document Hi,

1) I have a bunch of documents, some of which may contain dates.
2) For each document, I would like to identify all existing dates.
3) The dates exist in multiple formats.
Example: (May 5, 1932), (5,5,1932), (5/5/1932), (5-5-1942) etc.
4) I would like to process a single document very quickly. In other
words, i'm looking for a time efficient solution.

Are there any existing parsers that take in a generic text string and
return a collection of dates found in the text string?

Are there any existing flex/cup files that perform this task?

Thanks.
 
Jacob





PostPosted: 2004-8-10 18:18:00 Top

java-programmer >> Extract all dates from text document Chris Tryp wrote:

> 1) I have a bunch of documents, some of which may contain dates.
> 2) For each document, I would like to identify all existing dates.
> 3) The dates exist in multiple formats.
> Example: (May 5, 1932), (5,5,1932), (5/5/1932), (5-5-1942) etc.
> 4) I would like to process a single document very quickly. In other
> words, i'm looking for a time efficient solution.


The quikest/simplest solution would most probably
be a sed, awk, perl or python script to scan your
files and produce the requested output.

There is no need to do this in Java, unless you
already parse these files or need the result
internally in a program. In that case you will need
to define a suitable "token", and a set of possible
date formats (SimpleDateFormat). Read through your
file with a tokenizer and apply all your date formats.
If any of them can parse the token without an exception
you've found a date and transformed it into a
java.util.Date.

 
VisionSet





PostPosted: 2004-8-10 18:31:00 Top

java-programmer >> Extract all dates from text document "Chris Tryp" <email***@***.com> wrote in message
news:email***@***.com...
> Hi,
>
> 1) I have a bunch of documents, some of which may contain dates.
> 2) For each document, I would like to identify all existing dates.
> 3) The dates exist in multiple formats.
> Example: (May 5, 1932), (5,5,1932), (5/5/1932), (5-5-1942) etc.
> 4) I would like to process a single document very quickly. In other
> words, i'm looking for a time efficient solution.
>
> Are there any existing parsers that take in a generic text string and
> return a collection of dates found in the text string?
>
> Are there any existing flex/cup files that perform this task?

java.text.SimpleDataFormat

--
Mike W


 
 
William Brogden





PostPosted: 2004-8-10 20:29:00 Top

java-programmer >> Extract all dates from text document On 10 Aug 2004 02:20:15 -0700, Chris Tryp <email***@***.com> wrote:

> Hi,
>
> 1) I have a bunch of documents, some of which may contain dates.
> 2) For each document, I would like to identify all existing dates.
> 3) The dates exist in multiple formats.
> Example: (May 5, 1932), (5,5,1932), (5/5/1932), (5-5-1942) etc.
> 4) I would like to process a single document very quickly. In other
> words, i'm looking for a time efficient solution.
>
> Are there any existing parsers that take in a generic text string and
> return a collection of dates found in the text string?
>
> Are there any existing flex/cup files that perform this task?
>
> Thanks.

Actually, I did something like this for legal documents - quite a
while ago, and (as I recall) in C++.

If I had to program this now I would do a preliminary scan looking for
potential dates - probably by looking for digits. This should let you
skip large chunks of text. Then apply a tokenizer and look for
patterns that could be a date.

In other words, use progressive refinement of the search, brute
force application of SimpleDateFormat would be a waste of time.


Bill
http://www.wbrogden.com/
 
 
Rhino





PostPosted: 2004-8-11 0:34:00 Top

java-programmer >> Extract all dates from text document
"Chris Tryp" <email***@***.com> wrote in message
news:email***@***.com...
> Hi,
>
> 1) I have a bunch of documents, some of which may contain dates.
> 2) For each document, I would like to identify all existing dates.
> 3) The dates exist in multiple formats.
> Example: (May 5, 1932), (5,5,1932), (5/5/1932), (5-5-1942) etc.
> 4) I would like to process a single document very quickly. In other
> words, i'm looking for a time efficient solution.
>
Are the dates really going to be useful without knowing the context? In
other words, don't you need to know the *meaning* of the dates, such as that
the first one is the birth date of Joe Blow and the second one is the date
that he retired? Without context, the dates don't have a lot of meaning.

I'm reminded of a joke where a comedian, pretending to be a news announcer,
said "And now a list of partial scores from tonight's games: 7, 18, 3 and
45."

Rhino


 
 
christryp





PostPosted: 2004-8-11 1:25:00 Top

java-programmer >> Extract all dates from text document Jacob <email***@***.com> wrote in message news:<email***@***.com>...
> Chris Tryp wrote:
>
> > 1) I have a bunch of documents, some of which may contain dates.
> > 2) For each document, I would like to identify all existing dates.
> > 3) The dates exist in multiple formats.
> > Example: (May 5, 1932), (5,5,1932), (5/5/1932), (5-5-1942) etc.
> > 4) I would like to process a single document very quickly. In other
> > words, i'm looking for a time efficient solution.
>
>
> The quikest/simplest solution would most probably
> be a sed, awk, perl or python script to scan your
> files and produce the requested output.
>
> There is no need to do this in Java, unless you
> already parse these files or need the result
> internally in a program.

Yes, I need the results in a program.

> In that case you will need
> to define a suitable "token", and a set of possible
> date formats (SimpleDateFormat). Read through your
> file with a tokenizer and apply all your date formats.
> If any of them can parse the token without an exception
> you've found a date and transformed it into a
> java.util.Date.

Writing an appropriate tokenizer is exactly what I'm trying to avoid.
I will do it if I have to, however, if it's already been done I don't
want to do it again.
I don't want to use SimpleDateFormat because
1) It only takes date strings as input not generic text.
2) Even if i properly tokenize the text string and run SimpleDateFormat
on the tokens, handling exceptions for invalid formats is slow.

If it turns out that I have to write my own program, is there
a reason why I shouldn't use flex and cup?

thanks.
 
 
Andrew Thompson





PostPosted: 2004-8-11 1:40:00 Top

java-programmer >> Extract all dates from text document On 10 Aug 2004 10:24:36 -0700, Chris Tryp wrote:

> 2) Even if i properly tokenize the text string and run SimpleDateFormat
> on the tokens, handling exceptions for invalid formats is slow.

How slow?

--
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
 
 
Jacob





PostPosted: 2004-8-11 15:17:00 Top

java-programmer >> Extract all dates from text document Chris Tryp wrote:

> 1) It only takes date strings as input not generic text.

You need to use a tokenizer to identify *candidates* for
dates. Then you apply your SimpleDateFormat objects.

> 2) Even if i properly tokenize the text string and run SimpleDateFormat
> on the tokens, handling exceptions for invalid formats is slow.

You can't get something for nothing; If you want to find dates,
you'll have to parse the text. You can reimplement the date
parser so it doesn't throw exceptions on non-dates, but do so
only if you are really, really sure you cannot live with the
extra overhead of exceptions.

I doubt this problem has been generically solved. Your
set of "possible" date string is different from someone
elses, and the union of all is quite large, at least when
bringing locale onto the scene.

One thing you have not stated is wether you actually have a
set of predefined ways to write days, or you must guess in
each case. The former problem is simple and is solved in
20-30 lines of code, while the other is hard, even maybe
impossible to solve completely.

 
 
P.Hill





PostPosted: 2004-8-12 1:19:00 Top

java-programmer >> Extract all dates from text document Jacob wrote:
> One thing you have not stated is wether you actually have a
> set of predefined ways to write days, or you must guess in
> each case. The former problem is simple and is solved in
> 20-30 lines of code, while the other is hard, even maybe
> impossible to solve completely.

Like mm/dd/yy versus dd/mm/yy.

-Paul