[[Research topics]]

*Statistical Estimation of Vocabulary Size Including "Unseen" Words. [#z777c4ac]

The number of unique words in children’s speech is one of most basic statistics indicating their language development. We may face, however, to a difficulty to accurately evaluate the number of unique words in a child’s growing corpus over time with a limited sample size. This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children. This technique utilizes statistical properties of the number of types as a function of the number of sampled tokens. We tested the practical effectiveness of the proposed method in the empirical data analysis of the cross-sectional and longitudinal samples. The converging empirical evidence suggests that the proposed estimator improves the accuracy of vocabulary size estimation over a naïve type-counting estimators. Utilizing this efficient estimator, we propose a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods.

*** Keywords [#rbf3acc1]
Vocabulary growth; Small sample size; Number of latent types; Type–token ratio;


(Hidaka, 2014, Biometrika)。

この研究結果はHidaka, S. (accepted). 
この研究成果はJournal of Child Language誌(Hidaka, S., accepted)にて発表しました. 


**Related papers (See also [[other publications>Publications]]/ 関連する発表論文 ([[その他の論文など>Publications]]) [#j8db609c]


トップ   編集 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS