edu.stanford.nlp.parser.lexparser
Class FactoredLexicon
java.lang.Object
edu.stanford.nlp.parser.lexparser.BaseLexicon
edu.stanford.nlp.parser.lexparser.FactoredLexicon
- All Implemented Interfaces:
- Lexicon, java.io.Serializable
public class FactoredLexicon
- extends BaseLexicon
- Author:
- Spence Green
- See Also:
- Serialized Form
Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseLexicon |
DEBUG_LEXICON, DEBUG_LEXICON_SCORE, flexiTag, NULL_ITW, nullTag, nullWord, op, rulesWithWord, seenCounter, smartMutation, smoothInUnknownsThreshold, tagIndex, tags, testOptions, trainOptions, useSignatureForKnownSmoothing, uwModel, uwModelTrainer, uwModelTrainerClass, wordIndex, words |
Method Summary |
protected void |
initRulesWithWord()
Rule table is lemmas! |
static void |
main(java.lang.String[] args)
|
java.util.Iterator<IntTaggedWord> |
ruleIteratorByWord(int word,
int loc,
java.lang.String featureSpec)
Rule table is lemmas. |
float |
score(IntTaggedWord iTW,
int loc,
java.lang.String word,
java.lang.String featureSpec)
Get the score of this word with this tag (as an IntTaggedWord) at this
location. |
void |
train(java.util.Collection<Tree> trees,
java.util.Collection<Tree> rawTrees)
This method should populate wordIndex, tagIndex, and morphIndex. |
Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseLexicon |
addAll, addAll, addTagging, evaluateCoverage, examineIntersection, finishTraining, getBaseTag, getUnknownWordModel, incrementTreesRead, initializeTraining, isKnown, isKnown, listToEvents, numRules, printLexStats, readData, ruleIteratorByWord, ruleIteratorByWord, setUnknownWordModel, train, train, train, train, train, trainUnannotated, trainWithExpansion, treeToEvents, tune, writeData |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
FactoredLexicon
public FactoredLexicon(MorphoFeatureSpecification morphoSpec,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex)
FactoredLexicon
public FactoredLexicon(Options op,
MorphoFeatureSpecification morphoSpec,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex)
ruleIteratorByWord
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word,
int loc,
java.lang.String featureSpec)
- Rule table is lemmas. So isKnown() is slightly trickier.
- Specified by:
ruleIteratorByWord
in interface Lexicon
- Overrides:
ruleIteratorByWord
in class BaseLexicon
- Parameters:
word
- The word (as an int)loc
- Its index in the sentence (usually only relevant for unknown words)featureSpec
- Additional word features like morphosyntactic information.
- Returns:
- A list of possible taggings
score
public float score(IntTaggedWord iTW,
int loc,
java.lang.String word,
java.lang.String featureSpec)
- Description copied from class:
BaseLexicon
- Get the score of this word with this tag (as an IntTaggedWord) at this
location. (Presumably an estimate of P(word | tag).)
Implementation documentation:
Seen:
c_W = count(W) c_TW = count(T,W)
c_T = count(T) c_Tunseen = count(T) among new words in 2nd half
total = count(seen words) totalUnseen = count("unseen" words)
p_T_U = Pmle(T|"unseen")
pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W
Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U
p_T= Pmle(T) p_W = Pmle(W)
pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule]
Note that this doesn't really properly reserve mass to unknowns.
Unseen:
c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen)
c_U = totalUnseen above
p_T_U = Pmle(T|Unseen)
pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]]
pb_W_T = log(P(W|T)) inverted
- Specified by:
score
in interface Lexicon
- Overrides:
score
in class BaseLexicon
- Parameters:
iTW
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their probability
distribution when sentence initialword
- The word itself; useful so we don't have to look it
up in an indexfeatureSpec
- TODO
- Returns:
- A float score, usually, log P(word|tag)
train
public void train(java.util.Collection<Tree> trees,
java.util.Collection<Tree> rawTrees)
- This method should populate wordIndex, tagIndex, and morphIndex.
- Specified by:
train
in interface Lexicon
- Overrides:
train
in class BaseLexicon
initRulesWithWord
protected void initRulesWithWord()
- Rule table is lemmas!
- Overrides:
initRulesWithWord
in class BaseLexicon
main
public static void main(java.lang.String[] args)
- Parameters:
args
-
Stanford NLP Group