[[Research topics]]

*Statistical Estimation of Vocabulary Size Including "Unseen" Words. [#z777c4ac]

The number of unique words in children’s speech is one of most basic statistics indicating their language development. We may face, however, to a difficulty to accurately evaluate the number of unique words in a child’s growing corpus over time with a limited sample size. This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children. This technique utilizes statistical properties of the number of types as a function of the number of sampled tokens. We tested the practical effectiveness of the proposed method in the empirical data analysis of the cross-sectional and longitudinal samples. The converging empirical evidence suggests that the proposed estimator improves the accuracy of vocabulary size estimation over a naïve type-counting estimators. Utilizing this efficient estimator, we propose a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods.

*** Keywords [#rbf3acc1]
Vocabulary growth; Small sample size; Number of latent types; Type–token ratio;

*観察されない語彙を含む語彙数の推定法[#f2383283]

//#ref(VocabGrowth/visualization_model.png,50%)

//図の説明: (a)事象X がある一定頻度で確率的に観察され(上段)、それが累積4 回に達した ときに単語を獲得する場合(中段、N=4)。獲得月齢はガンマ分布に従う(下段)。 (b)事象X の観測頻度が月齢を追って高くなり(上段, D>1)、一度の観察で単語が獲得される場合(中 段)。獲得月齢はワイブル分布に従う(下段)。

**Related papers (See also [[other publications>Publications]]/ 関連する発表論文 ([[その他の論文など>Publications]]) [#j8db609c]

#todo('',%VocabSizeEstimation%)


トップ   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS