Cody Boisclair, November–December 2007
One feature of the NLPUtils Morphological Parser is that it allows one to define additional irregular forms via an XML file specified by MorphParser.IrregularsFilePath
.
Here is a brief summary of the format expected in that XML file:
<IrregularForms>
. It contains a collection of <token>
elements representing tokens whose forms are irregular.<token>
element has one required attribute, spell
, which identifies the spelling of the token. Each <token>
element may contain any number of <parse>
elements representing a particular morphological parsing. A <token>
element may also be empty, identifying that the token is only to be parsed as a regular form (this may be used to negate irregular forms in the built-in lexicon).<parse>
element contains any number of <morph>
elements, each of which specifies a morph in that particular parsing.<morph>
element has one required attribute, spell
, identifying the spelling of the morph. It also contains the following two optional attributes, which take one of the values specified below:
type
= "stem"
, "feature"
cat
= "noun"
, "verb"
, "adj"
, "adv"
, "unknown"
type
is unspecified, the first morph will be assumed to be the stem and remaining morphs will be assumed to be features in the morphological analysis. For morphs where cat
is unspecified, the 'unknown' category is assumed.
An example XML document that defines the irregular noun plurals "otaku" and "paparazzi":
<IrregularForms> <token spell="otaku"> <parse> <morph spell="otaku" cat="noun" type="stem" /> </parse> <parse> <morph spell="otaku" cat="noun" type="stem" /> <morph spell="s" cat="noun" type="feature" /> </parse> </token> <token spell="paparazzi"> <parse> <morph spell="paparazzo" cat="noun" type="stem" /> <morph spell="s" cat="noun" type="feature" /> </parse> </token> </IrregularForms>
If you have any comments, suggestions, or potential improvements, don't hesitate to drop me a line at codemanb@uga.edu.