Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary".  
Author Message
Daniel Pitts





PostPosted: 2007-9-20 13:04:00 Top

java-programmer, Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". So, I've spent all day working on this. Funfun...

Back story: Project Gutenburg create free ebooks from content that is
now in the public domain, including the "1913 Webster Unabridged
Dictionary". The problem with this particular work (pgw050*.txt), is
that it uses a very "odd" character set, and an almost-xml markup (it
may be valid SGML, but I wouldn't bank on it)

Its part DOS extended ascii, and then some proprietary character
codes.

My goal:
I'd like to get this into a form that is easily processed by a
program. I think the best way to do this is to put it into a robust
XML formal. This would involved cleaning up the markup to be more
valid XML, as well as processing some of the character codes into
nicer forms. I've already written a program that will read the
original texts, and re-encode the files as UTF-8, using appropriate
character substitution when possible.

At this point, I'm not sure if I'd be better off converting their
custom "entities" into the equivalent UTF-8 encoded characters, or if
it would be better to convert all entities and non-standard characters
into some sort of XML encoded entities.

Anyone have suggestions on what would be the most useful way to go?

 
Hunter Gratzner





PostPosted: 2007-9-20 16:15:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Sep 20, 7:03 am, Daniel Pitts <email***@***.com> wrote:
> So, I've spent all day working on this. Funfun...
>
> Back story: Project Gutenburg

It's Gutenberg, not Gutenburg.

> create free ebooks from content that is
> now in the public domain, including the "1913 Webster Unabridged
> Dictionary". The problem with this particular work (pgw050*.txt), is

Thanks for not providing a link to the file, so we are saved from
having to have a look at it.


 
Jeff Higgins





PostPosted: 2007-9-20 21:39:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary".
Daniel Pitts wrote:
> So, I've spent all day working on this. Funfun...
>
> Back story: Project Gutenburg create free ebooks from content that is
> now in the public domain, including the "1913 Webster Unabridged
> Dictionary". The problem with this particular work (pgw050*.txt), is
> that it uses a very "odd" character set, and an almost-xml markup (it
> may be valid SGML, but I wouldn't bank on it)
>
> Its part DOS extended ascii, and then some proprietary character
> codes.
>
> My goal:
> I'd like to get this into a form that is easily processed by a
> program. I think the best way to do this is to put it into a robust
> XML formal. This would involved cleaning up the markup to be more
> valid XML, as well as processing some of the character codes into
> nicer forms. I've already written a program that will read the
> original texts, and re-encode the files as UTF-8, using appropriate
> character substitution when possible.
>
Whew. After a quick read of webfont.asc and tagset.web I can feel
your pain. I think the main problem here is that the typesetters /style/
conveys so much information. For instance:

216 d8 ? <par/ double vertical bar (short length; the long
length is the graphics character 186)
This precedes words marked with a double vertical bar in
the original dictionary, signifying that the word was
adopted directly into English without modification of
the spelling.

For myself, I suppose the question would be: Do I want my
/program/ to understand and/or act upon the fact that a character
code 0xd8 signifies the above or is it strictly for a /human/ readers'
consumption? If the former probably an XML tag would be appropriate,
if the latter maybe an appropriate glyph is sufficient.

<http://www.gutenberg.org/dirs/etext96/pgw050ab.txt>

> At this point, I'm not sure if I'd be better off converting their
> custom "entities" into the equivalent UTF-8 encoded characters, or if
> it would be better to convert all entities and non-standard characters
> into some sort of XML encoded entities.
>
> Anyone have suggestions on what would be the most useful way to go?
>


 
 
Daniel Pitts





PostPosted: 2007-9-20 22:25:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Sep 20, 1:14 am, Hunter Gratzner <email***@***.com> wrote:
> On Sep 20, 7:03 am, Daniel Pitts <email***@***.com> wrote:
>
> > So, I've spent all day working on this. Funfun...
>
> > Back story: Project Gutenburg
>
> It's Gutenberg, not Gutenburg.
I actually knew that, but my fingers decided to do what they wanted,
not what I wanted :-)

>
> > create free ebooks from content that is
> > now in the public domain, including the "1913 Webster Unabridged
> > Dictionary". The problem with this particular work (pgw050*.txt), is
>
> Thanks for not providing a link to the file, so we are saved from
> having to have a look at it.

Ah, indeed.

Thanks for the constructive response.

Jeff Higgins provided the link in a reply: <http://www.gutenberg.org/
dirs/etext96/pgw050ab.txt>
Thanks Jeff!



Thanks,
Daniel.

 
 
Roedy Green





PostPosted: 2007-9-21 1:24:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Thu, 20 Sep 2007 05:03:36 -0000, Daniel Pitts
<email***@***.com> wrote, quoted or indirectly quoted
someone who said :

>At this point, I'm not sure if I'd be better off converting their
>custom "entities" into the equivalent UTF-8 encoded characters, or if
>it would be better to convert all entities and non-standard characters
>into some sort of XML encoded entities.

Perhaps the way to go is to devise a font that renders these odd
characters correctly. Then the text could be easily manipulated
programmatically with tiny mods to existing software. Then you could
even publish it as a PDF document.

Your problem then becomes political, talking some skilled type
designer into donating her skills in return for some exposure.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
 
Roedy Green





PostPosted: 2007-9-21 1:30:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Thu, 20 Sep 2007 17:24:26 GMT, Roedy Green
<email***@***.com> wrote, quoted or indirectly quoted
someone who said :

>
>Your problem then becomes political, talking some skilled type
>designer into donating her skills in return for some exposure.

If you have some high res scans of the original text, your job is not
designing a font, but the much easier job of "stealing" the font from
the original samples. I looked into a similar problem circa 1990 to
"steal" Chinese fonts from hand painted fonts on mechanical optical
typesetters. The tools were primitive -- interactively defining
Bezier curves with Adobe tools.

There are people who will create you a font from a sample of your
handwriting or printing for a nominal charge. Perhaps one of them has
the tools and skills to solve your problem.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
 
Jeff Higgins





PostPosted: 2007-9-21 4:37:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary".
Jeff Higgins wrote:
>
> Daniel Pitts wrote:
>> So, I've spent all day working on this. Funfun...
>>
>> Back story: Project Gutenburg create free ebooks from content that is
>> now in the public domain, including the "1913 Webster Unabridged
>> Dictionary". The problem with this particular work (pgw050*.txt), is
>> that it uses a very "odd" character set, and an almost-xml markup (it
>> may be valid SGML, but I wouldn't bank on it)
>>

Another thought strikes me. Have you looked any of the many
"dictionary markup" languages already out there? Have you seen
the GNU CIDE?
http://www.ibiblio.org/webster/


 
 
Daniel Pitts





PostPosted: 2007-9-21 4:43:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Sep 20, 6:39 am, "Jeff Higgins" <email***@***.com> wrote:
> Daniel Pitts wrote:
> > So, I've spent all day working on this. Funfun...
>
> > Back story: Project Gutenburg create free ebooks from content that is
> > now in the public domain, including the "1913 Webster Unabridged
> > Dictionary". The problem with this particular work (pgw050*.txt), is
> > that it uses a very "odd" character set, and an almost-xml markup (it
> > may be valid SGML, but I wouldn't bank on it)
>
> > Its part DOS extended ascii, and then some proprietary character
> > codes.
>
> > My goal:
> > I'd like to get this into a form that is easily processed by a
> > program. I think the best way to do this is to put it into a robust
> > XML formal. This would involved cleaning up the markup to be more
> > valid XML, as well as processing some of the character codes into
> > nicer forms. I've already written a program that will read the
> > original texts, and re-encode the files as UTF-8, using appropriate
> > character substitution when possible.
>
> Whew. After a quick read of webfont.asc and tagset.web I can feel
> your pain. I think the main problem here is that the typesetters /style/
> conveys so much information. For instance:
>
> 216 d8 ? <par/ double vertical bar (short length; the long
> length is the graphics character 186)
> This precedes words marked with a double vertical bar in
> the original dictionary, signifying that the word was
> adopted directly into English without modification of
> the spelling.
>
> For myself, I suppose the question would be: Do I want my
> /program/ to understand and/or act upon the fact that a character
> code 0xd8 signifies the above or is it strictly for a /human/ readers'
> consumption? If the former probably an XML tag would be appropriate,
> if the latter maybe an appropriate glyph is sufficient.

Thanks for the reply. My main goal is to retain as much semantic
meaning as possible for the program to understand. So if I understand
your point, I should convert it to XML tags to maintain that
information...

This brings up a related point. In XML, can "&blah;" entities have
semantic meaning associated with them? Or are they only replacements
for otherwise difficult-to-represent characters? That makes a
difference between using &directlyAdopted; and <directly-adopted/>


>
> <http://www.gutenberg.org/dirs/etext96/pgw050ab.txt>
>
> > At this point, I'm not sure if I'd be better off converting their
> > custom "entities" into the equivalent UTF-8 encoded characters, or if
> > it would be better to convert all entities and non-standard characters
> > into some sort of XML encoded entities.
>
> > Anyone have suggestions on what would be the most useful way to go?


Thanks,
Daniel.

 
 
Jeff Higgins





PostPosted: 2007-9-21 5:11:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary".
Daniel Pitts wrote:
On Sep 20, 6:39 am, "Jeff Higgins" <email***@***.com> wrote:
> Daniel Pitts wrote:
> > So, I've spent all day working on this. Funfun...
>
> > Back story: Project Gutenburg create free ebooks from content that is
> > now in the public domain, including the "1913 Webster Unabridged
> > Dictionary". The problem with this particular work (pgw050*.txt), is
> > that it uses a very "odd" character set, and an almost-xml markup (it
> > may be valid SGML, but I wouldn't bank on it)
>
> > Its part DOS extended ascii, and then some proprietary character
> > codes.
>
> > My goal:
> > I'd like to get this into a form that is easily processed by a
> > program. I think the best way to do this is to put it into a robust
> > XML formal. This would involved cleaning up the markup to be more
> > valid XML, as well as processing some of the character codes into
> > nicer forms. I've already written a program that will read the
> > original texts, and re-encode the files as UTF-8, using appropriate
> > character substitution when possible.
>
> Whew. After a quick read of webfont.asc and tagset.web I can feel
> your pain. I think the main problem here is that the typesetters /style/
> conveys so much information. For instance:
>
> 216 d8 ? <par/ double vertical bar (short length; the long
> length is the graphics character 186)
> This precedes words marked with a double vertical bar in
> the original dictionary, signifying that the word was
> adopted directly into English without modification of
> the spelling.
>
> For myself, I suppose the question would be: Do I want my
> /program/ to understand and/or act upon the fact that a character
> code 0xd8 signifies the above or is it strictly for a /human/ readers'
> consumption? If the former probably an XML tag would be appropriate,
> if the latter maybe an appropriate glyph is sufficient.

Thanks for the reply. My main goal is to retain as much semantic
meaning as possible for the program to understand. So if I understand
your point, I should convert it to XML tags to maintain that
information...

This brings up a related point. In XML, can "&blah;" entities have
semantic meaning associated with them? Or are they only replacements
for otherwise difficult-to-represent characters? That makes a
difference between using &directlyAdopted; and <directly-adopted/>

Well, if your asking me personally, I'd have to say I'm no XML expert
and that the best I could do is to point you to the appropriate part
of the spec, sorry.

<http://www.w3.org/TR/2006/REC-xml-20060816/#sec-physical-struct>

>
> <http://www.gutenberg.org/dirs/etext96/pgw050ab.txt>
>
> > At this point, I'm not sure if I'd be better off converting their
> > custom "entities" into the equivalent UTF-8 encoded characters, or if
> > it would be better to convert all entities and non-standard characters
> > into some sort of XML encoded entities.
>
> > Anyone have suggestions on what would be the most useful way to go?


Thanks,
Daniel.


 
 
Daniel Pitts





PostPosted: 2007-9-21 5:23:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Sep 20, 1:36 pm, "Jeff Higgins" <email***@***.com> wrote:
> Jeff Higgins wrote:
>
> > Daniel Pitts wrote:
> >> So, I've spent all day working on this. Funfun...
>
> >> Back story: Project Gutenburg create free ebooks from content that is
> >> now in the public domain, including the "1913 Webster Unabridged
> >> Dictionary". The problem with this particular work (pgw050*.txt), is
> >> that it uses a very "odd" character set, and an almost-xml markup (it
> >> may be valid SGML, but I wouldn't bank on it)
>
> Another thought strikes me. Have you looked any of the many
> "dictionary markup" languages already out there? Have you seen
> the GNU CIDE?http://www.ibiblio.org/webster/

Heh, same source material, but it looks like more care was taken in
the translation to *machine readable* format. I'll check it out.
Thanks for the pointer. (Searching for Public Domain Dictionary
doesn't turn up as much relevant hits as it should :-) )

 
 
RedGrittyBrick





PostPosted: 2007-9-21 17:43:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". Roedy Green wrote:
> On Thu, 20 Sep 2007 05:03:36 -0000, Daniel Pitts
> <email***@***.com> wrote, quoted or indirectly quoted
> someone who said :
>
>
>>At this point, I'm not sure if I'd be better off converting their
>>custom "entities" into the equivalent UTF-8 encoded characters, or if
>>it would be better to convert all entities and non-standard characters
>>into some sort of XML encoded entities.
>
>
> Perhaps the way to go is to devise a font that renders these odd
> characters correctly. Then the text could be easily manipulated
> programmatically with tiny mods to existing software. Then you could
> even publish it as a PDF document.
>
> Your problem then becomes political, talking some skilled type
> designer into donating her skills in return for some exposure.

The purpose of a dictionary is semantic. The actual glyphs are
comparatively unimportant. The intellectual accomplishment does not lie
mainly in the choice of symbols.

If you want to reproduce the beautiful typography of the original, use
high quality image scans.

Otherwise I'd translate the glyphs to something semantically or visually
close in the unicode character set.

I think I'd try for a purely semantic markup in XML. Then create a
stylesheet that would render it in XHTML (say) and which would introduce
glyphs and fonts as close to the original as possible. That way, if
unicode ever gets extended to include some of the odd characters used in
the original, you only have to amend the stylesheet.

So I'd represent the "double vertical bar" as an attribute of a tag.
e.g. <word spelling="adopted"> The stylesheet could insert a glyph
visually close to "double vertical bar".

In particular, I'd translate markup like "<universbold>" into
<exposition> or <shape-description> or something. I'm pretty sure
Webster didn't compose his dictionary with LaserJet fonts in mind :-)

 
 
Daniel Pitts





PostPosted: 2007-9-21 23:31:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Sep 21, 2:43 am, RedGrittyBrick <email***@***.com>
wrote:
> Roedy Green wrote:
> > On Thu, 20 Sep 2007 05:03:36 -0000, Daniel Pitts
> > <email***@***.com> wrote, quoted or indirectly quoted
> > someone who said :
>
> >>At this point, I'm not sure if I'd be better off converting their
> >>custom "entities" into the equivalent UTF-8 encoded characters, or if
> >>it would be better to convert all entities and non-standard characters
> >>into some sort of XML encoded entities.
>
> > Perhaps the way to go is to devise a font that renders these odd
> > characters correctly. Then the text could be easily manipulated
> > programmatically with tiny mods to existing software. Then you could
> > even publish it as a PDF document.
>
> > Your problem then becomes political, talking some skilled type
> > designer into donating her skills in return for some exposure.
>
> The purpose of a dictionary is semantic. The actual glyphs are
> comparatively unimportant. The intellectual accomplishment does not lie
> mainly in the choice of symbols.
>
> If you want to reproduce the beautiful typography of the original, use
> high quality image scans.
>
> Otherwise I'd translate the glyphs to something semantically or visually
> close in the unicode character set.
>
> I think I'd try for a purely semantic markup in XML. Then create a
> stylesheet that would render it in XHTML (say) and which would introduce
> glyphs and fonts as close to the original as possible. That way, if
> unicode ever gets extended to include some of the odd characters used in
> the original, you only have to amend the stylesheet.
>
> So I'd represent the "double vertical bar" as an attribute of a tag.
> e.g. <word spelling="adopted"> The stylesheet could insert a glyph
> visually close to "double vertical bar".
>
> In particular, I'd translate markup like "<universbold>" into
> <exposition> or <shape-description> or something. I'm pretty sure
> Webster didn't compose his dictionary with LaserJet fonts in mind :-)

Heh. He probably was using a BubbleJet :-)

But seriously. I'd like to keep the original intent (the
transcriber's, not necessarily Webster's), and then in a later stage
of the processing, convert it to the more semantic meaning, and
probably ignore the rendering of that information. My personal use-
case actually only cares about the relationships between words, and
the part of speech. For instance, I'd like to be able to recognize
Ran, Run, and Runs as different tenses of the same word, and Leaf/
Leaves as different inflections of the same word.

Actually, thats not quite my "ultimate" goal. The ultimate goal is to
create an English Imperative Sentence parser to use in a text
adventure game. I just figured I might as well do something useful
for the community while I'm at it (in this case, semanticize the
dictionary). Although it appears that gcide_xml may have done what I
wanted to do already.

 
 
John W. Kennedy





PostPosted: 2007-9-22 11:10:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". Daniel Pitts wrote:
> Actually, thats not quite my "ultimate" goal. The ultimate goal is to
> create an English Imperative Sentence parser to use in a text
> adventure game.

I cannot find that you have ever participated in rec.arts.int-fiction.
Assuming this to be true, then it is highly likely you have no idea of
what you are getting into. Most fundamentally, you can't do a useful I-F
parser (assuming that, by "parser", you mean more than a mere lexer)
unless it is integrated with the world model. And you're also going to
have to create a descriptive language and a compiler for it.

Please study Inform 6, Inform 7 (they are completely different), TADS 2,
TADS 3, Hugo, and Adrift, and then see if A) you really have anything
new to contribute to the state of the art, and B) you have the time to
produce it. I would estimate that any new system offering a significant
improvement on existing tools should take about ten man-years to do from
scratch. You'll also probably need at least two collaborators, a test
writer, and a documentation writer. At a minimum, don't try to create
your own tests; you need a dedicated adversary, because this problem
domain is rife with edge and corner cases.

--
John W. Kennedy
"The whole modern world has divided itself into Conservatives and
Progressives. The business of Progressives is to go on making mistakes.
The business of the Conservatives is to prevent the mistakes from being
corrected."
-- G. K. Chesterton
 
 
Daniel Pitts





PostPosted: 2007-9-22 12:26:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Sep 21, 8:10 pm, "John W. Kennedy" <email***@***.com> wrote:
> Daniel Pitts wrote:
> > Actually, thats not quite my "ultimate" goal. The ultimate goal is to
> > create an English Imperative Sentence parser to use in a text
> > adventure game.
>
> I cannot find that you have ever participated in rec.arts.int-fiction.
Indeed, I have not.
> Assuming this to be true, then it is highly likely you have no idea of
> what you are getting into. Most fundamentally, you can't do a useful I-F
> parser (assuming that, by "parser", you mean more than a mere lexer)
> unless it is integrated with the world model. And you're also going to
> have to create a descriptive language and a compiler for it.
Actually, my plan is to describe the world model with Java objects
(hence this being a Java group)
>
> Please study Inform 6, Inform 7 (they are completely different), TADS 2,
> TADS 3, Hugo, and Adrift, and then see if A) you really have anything
> new to contribute to the state of the art, and B) you have the time to
> produce it.
A) If I don't have anything worth while to contribute, at least I'll
have gained knowledge. This isn't about bettering existing tools and
platforms, but about bettering myself. I will take a look at those
you suggested, but I'll probably continue on with my project anyway.
I do have *some* experience working on a Lima M.U.D.

> I would estimate that any new system offering a significant
> improvement on existing tools should take about ten man-years to do from
> scratch. You'll also probably need at least two collaborators, a test
> writer, and a documentation writer. At a minimum, don't try to create
> your own tests; you need a dedicated adversary, because this problem
> domain is rife with edge and corner cases.
Agreed. The part that I find the most difficult to model, parse, and
query is the complex relationships that can occur amongst several
objects. It's easy enough to say that a bowl in on a table, but what
about an apple between the banana and the orange in the bowl on the
wooden table.

Every journey starts with but a footstep. It may take 10 man years to
complete, but if I don't start on my own, I'll never know. I'm 26, so
if this a project that takes me until I'm 36, I'll still be young
enough to enjoy the results. In any case, if this DOES get to a
point where I think it might become something useful to the community,
I'm sure I will be able to find plenty of collaborators.

Thanks for the pointers both to the existing projects, and to the raif
group. I'm sure I will find it invaluable as I go on.

Cheers,
Daniel.




 
 
Patricia Shanahan





PostPosted: 2007-9-22 13:10:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". Daniel Pitts wrote:
...
> Agreed. The part that I find the most difficult to model, parse, and
> query is the complex relationships that can occur amongst several
> objects. It's easy enough to say that a bowl in on a table, but what
> about an apple between the banana and the orange in the bowl on the
> wooden table.

I think there are far more basic issues. Here's a classic example of the
context-sensitivity of the English language: "Time flies like an arrow.".

If it is advice from a senior researcher to a junior researcher in an
entymology lab, "time" is a verb, "flies" is a noun, and "like an arrow"
modifies how to go about timing flies.

If it is a comment on how fast time seems to go by, "time" is a noun,
"flies" is a verb, and "like an arrow" modifies how time flies.

Patricia
 
 
RedGrittyBrick





PostPosted: 2007-9-22 19:57:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". Patricia Shanahan wrote:
> Daniel Pitts wrote:
> ...
>> Agreed. The part that I find the most difficult to model, parse, and
>> query is the complex relationships that can occur amongst several
>> objects. It's easy enough to say that a bowl in on a table, but what
>> about an apple between the banana and the orange in the bowl on the
>> wooden table.
>
> I think there are far more basic issues. Here's a classic example of the
> context-sensitivity of the English language: "Time flies like an arrow.".
>
> If it is advice from a senior researcher to a junior researcher in an
> entymology lab, "time" is a verb, "flies" is a noun, and "like an arrow"
> modifies how to go about timing flies.
>
> If it is a comment on how fast time seems to go by, "time" is a noun,
> "flies" is a verb, and "like an arrow" modifies how time flies.
>

Time flies like an arrow.
Fruit flies like a banana.
- Groucho Marx


--
RGB
 
 
Daniel Pitts





PostPosted: 2007-9-23 1:33:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". On Sep 21, 10:10 pm, Patricia Shanahan <email***@***.com> wrote:
> Daniel Pitts wrote:
>
> ...
>
> > Agreed. The part that I find the most difficult to model, parse, and
> > query is the complex relationships that can occur amongst several
> > objects. It's easy enough to say that a bowl in on a table, but what
> > about an apple between the banana and the orange in the bowl on the
> > wooden table.
>
> I think there are far more basic issues. Here's a classic example of the
> context-sensitivity of the English language: "Time flies like an arrow.".
>
> If it is advice from a senior researcher to a junior researcher in an
> entymology lab, "time" is a verb, "flies" is a noun, and "like an arrow"
> modifies how to go about timing flies.
>
> If it is a comment on how fast time seems to go by, "time" is a noun,
> "flies" is a verb, and "like an arrow" modifies how time flies.
>
> Patricia

I actually have a plan on how to handle context, but that particular
sentence is not imperative in the second sense that you provided.
Since I'm narrowing the scope of sentence types down to imperative,
that helps eliminate _some_ ambiguous situations. Indeed, most
languages (including programming) are somewhat sensitive to context.

For example, the Java "sentence":
s+=10;

could mean "Increase the int 's' by 10.", or "append '10' to the
String 's'". It could even be an error if "s" isn't numeric or a
String.

The only reason that isn't considered a problem in Java, is that its
"easy" to determine the context of a statement (scoping rules are
specific and well-defined). On the other hand, "Get the other key"
depends on context that would be harder to model in a computer.
Especially after a few interactions...

"You see a red key and a blue key."
Look at the red key
"The key is red."
Look at the other key
"The other key is blue."
Get the other key. <-- Does other point to the other other key, or to
the original other key?

Its been my experience with interactive fictions that the sentence
interpreters tend to need you to be very specific. I'm sure there are
some out there that have forms of context handling, but I want to
experiment on my own to see how I would go about it.

Originally, I think contextual information will have to be provided by
the world-view designer, with a little help about the "obvious"
context. Eventually, if the imperative sentence parser becomes good
enough, I would consider expanding the scope of it so that the parser
understood other types of sentences, and could glean information about
the current context simply by the descriptions involved.



 
 
John W. Kennedy





PostPosted: 2007-9-23 3:43:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". Daniel Pitts wrote:
> Agreed. The part that I find the most difficult to model, parse, and
> query is the complex relationships that can occur amongst several
> objects. It's easy enough to say that a bowl in on a table, but what
> about an apple between the banana and the orange in the bowl on the
> wooden table.

You're still looking at the purely linguistic problems. But there's more
to it than that. For example, what about a cabinet with a closed door,
but which also has a flat surface on top? What if the door is made of
glass? What if it's made of smoky glass, but there's a switch that can
turn on an interior light? All these things have to be handled by the
world model, but -- they also drag in your parser's disambiguator.

--
John W. Kennedy
"Sweet, was Christ crucified to create this chat?"
-- Charles Williams. "Judgement at Chelmsford"
 
 
John W. Kennedy





PostPosted: 2007-9-23 3:58:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". Patricia Shanahan wrote:
> Daniel Pitts wrote:
> ....
>> Agreed. The part that I find the most difficult to model, parse, and
>> query is the complex relationships that can occur amongst several
>> objects. It's easy enough to say that a bowl in on a table, but what
>> about an apple between the banana and the orange in the bowl on the
>> wooden table.

> I think there are far more basic issues. Here's a classic example of the
> context-sensitivity of the English language: "Time flies like an arrow.".

> If it is advice from a senior researcher to a junior researcher in an
> entymology lab, "time" is a verb, "flies" is a noun, and "like an arrow"
> modifies how to go about timing flies.

> If it is a comment on how fast time seems to go by, "time" is a noun,
> "flies" is a verb, and "like an arrow" modifies how time flies.

And if it is an observation by an surrealist, "time" is an adjective,
"flies" is a noun, "like" is a verb, and "an arrow" is the direct object.

Here's a worse one: "It's a pretty little girls school". I count six
parsings.

--
John W. Kennedy
"I want everybody to be smart. As smart as they can be. A world of
ignorant people is too dangerous to live in."
-- Garson Kanin. "Born Yesterday"
 
 
ram





PostPosted: 2007-9-23 4:23:00 Top

java-programmer >> Properly encoding "Project Gutenburg 1913 Webster Unabridged Dictionary". "John W. Kennedy" <email***@***.com> writes:
>And if it is an observation by an surrealist, "time" is an adjective,
>"flies" is a noun, "like" is a verb, and "an arrow" is the direct object.

籟I]n an analysis of a set of 891 sentences
ranging in length from 1 to 25 words, a team led by
Kathryn Baker found an average of 27 possible ways to
parse each sentence.?
http://scienceblogs.com/cognitivedaily/2006/12/machine_translation_taking_a_q.php

?Time flies like an arrow" --

1. Time proceeds as quickly as an arrow proceeds.
(the intended reading)

2. Measure the speed of flies in the same way that
you measure the speed of an arrow.

3. Measure the speed of flies in the same way that
an arrow measures the speed of flies.

4. Measure the speed of flies that resemble an arrow.

5. Flies of a particular kind, time-flies,
are fond of an arrow.?
籘he Language Instinct? Steven Pinker