edu.stanford.nlp.parser.lexparser
Class ChineseUnknownWordModel

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
      extended by edu.stanford.nlp.parser.lexparser.ChineseUnknownWordModel
All Implemented Interfaces:
UnknownWordModel, java.io.Serializable

public class ChineseUnknownWordModel
extends BaseUnknownWordModel

Stores, trains, and scores with an unknown word model. A couple of filters deterministically force rewrites for certain proper nouns, dates, and cardinal and ordinal numbers; when none of these filters are met, either the distribution of terminals with the same first character is used, or Good-Turing smoothing is used. Although this is developed for Chinese, the training and storage methods could be used cross-linguistically.

Author:
Roger Levy
See Also:
Serialized Form

Field Summary
 
Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
NULL_ITW, nullTag, nullWord, tagHash, tagIndex, trainOptions, unknown, unknownLevel, unSeenCounter, useFirst, useGT, VERBOSE, wordIndex
 
Constructor Summary
ChineseUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)
          This constructor creates an UWM with empty data structures.
ChineseUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter, java.util.HashMap<Label,ClassicCounter<java.lang.String>> tagHash, java.util.HashMap<java.lang.String,java.lang.Float> unknownGT, boolean useGT, java.util.Set<java.lang.String> seenFirst)
           
 
Method Summary
 java.lang.String getSignature(java.lang.String word, int loc)
          Signature for a specific word; loc parameter is ignored.
static void main(java.lang.String[] args)
           
 float score(IntTaggedWord itw, java.lang.String word)
           
 
Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel
addTagging, getLexicon, getSignatureIndex, getUnknownLevel, score, scoreGT, scoreProbTagGivenWordSignature, setUnknownLevel, unSeenCounter
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ChineseUnknownWordModel

public ChineseUnknownWordModel(Options op,
                               Lexicon lex,
                               Index<java.lang.String> wordIndex,
                               Index<java.lang.String> tagIndex,
                               ClassicCounter<IntTaggedWord> unSeenCounter,
                               java.util.HashMap<Label,ClassicCounter<java.lang.String>> tagHash,
                               java.util.HashMap<java.lang.String,java.lang.Float> unknownGT,
                               boolean useGT,
                               java.util.Set<java.lang.String> seenFirst)

ChineseUnknownWordModel

public ChineseUnknownWordModel(Options op,
                               Lexicon lex,
                               Index<java.lang.String> wordIndex,
                               Index<java.lang.String> tagIndex)
This constructor creates an UWM with empty data structures. Only use if loading in the data separately, such as by reading in text lines containing the data. TODO: would need to set useGT correctly if you saved a model with useGT and then wanted to recover it from text.

Method Detail

score

public float score(IntTaggedWord itw,
                   java.lang.String word)
Overrides:
score in class BaseUnknownWordModel

main

public static void main(java.lang.String[] args)

getSignature

public java.lang.String getSignature(java.lang.String word,
                                     int loc)
Description copied from class: BaseUnknownWordModel
Signature for a specific word; loc parameter is ignored.

Specified by:
getSignature in interface UnknownWordModel
Overrides:
getSignature in class BaseUnknownWordModel
Parameters:
word - The word
loc - Its sentence position
Returns:
A "signature" (which represents an equivalence class of Strings), e.g., a suffix of the string


Stanford NLP Group