VocabSizeEstimation

Statistical Estimation of Vocabulary Size Including "Unseen" Words. †

The number of unique words in children’s speech is one of most basic statistics indicating their language development. We may face, however, to a difficulty to accurately evaluate the number of unique words in a child’s growing corpus over time with a limited sample size. This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children. This technique utilizes statistical properties of the number of types as a function of the number of sampled tokens. We tested the practical effectiveness of the proposed method in the empirical data analysis of the cross-sectional and longitudinal samples. The converging empirical evidence suggests that the proposed estimator improves the accuracy of vocabulary size estimation over a naïve type-counting estimators. Utilizing this efficient estimator, we propose a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods.

↑

Keywords †

Vocabulary growth; Small sample size; Number of latent types; Type–token ratio;

↑

観察されない語彙を含む語彙数の推定法 †

任意のある確率分布に従って単語を抽出する場合に，抽出単語数に対する単語の種類数の確率分布がポアソン二項分布に漸近的に従うことを証明しました (Hidaka, 2014, Biometrika)。この結果を用いると，抽出した単語数に対する単語種類数のデータから，潜在的にどの程度の数の未知の単語種類数が存在するか統計的に見積もることが可能になります。従って，この成果を応用することで，言語獲得期の幼児の獲得単語数，コーパスデータの単語数，生態系における種数，など，様々な分野で未知の項目の種類数をより正確に概算する事が可能になります。

この理論的な結果を受けて、実際の幼児の語彙数を分析したところ，従来のように単に観察された語彙数よりもより正確に語彙数を見積もることが可能である事が分かりました。この研究成果はJournal of Child Language誌(Hidaka, S., accepted)にて発表しました.

PubSelected?Pub2014MethodologicalImprovementVocabSizeEstimationHidaka2014VocabEstimation? Hidaka, S. (2014). General type-token distribution., Biometrika. 101 (4), 999-1002. doi: 10.1093/biomet/asu035. (First published online: August 17, 2014) (pdf). (link) [Publications]
PubSelected?Pub2016MethodologicalImprovementVocabSizeEstimationHidaka2014VocabEstimation? Hidaka, S. (2016). Estimating the latent number of types in growing corpora with reduced cost–accuracy trade-off. Journal of Child Language, 43, pp 107-134. [Publications]

↑

Related papers (See also other publications/ 関連する発表論文 (その他の論文など) †

PreprintPub2013MethodologicalImprovementVocabSizeEstimation Hidaka, S. (2013). General Type Token Distribution., eprint arXiv:1305.0328. (link) [Publications]
Pub2014MethodologicalImprovementVocabSizeEstimation幼児が獲得している単語の抽出単語数と獲得単語種類数の理論的な関係を証明　～氷山の一角から潜在語彙数の推定～, 北陸先端科学技術大学院大学 (2014 年 9 月 10 日).(link) [Publications]
Pub2014MethodologicalImprovementVocabSizeEstimation幼児の語彙力統計学で測定, 北國新聞 (2014 年 9 月 11 日). (pdf) [Publications]
PubSelected?Pub2014MethodologicalImprovementVocabSizeEstimationHidaka2014VocabEstimation? Hidaka, S. (2014). General type-token distribution., Biometrika. 101 (4), 999-1002. doi: 10.1093/biomet/asu035. (First published online: August 17, 2014) (pdf). (link) [Publications]
PubSelected?Pub2016MethodologicalImprovementVocabSizeEstimationHidaka2014VocabEstimation? Hidaka, S. (2016). Estimating the latent number of types in growing corpora with reduced cost–accuracy trade-off. Journal of Child Language, 43, pp 107-134. [Publications]
VocabGrowth[VocabGrowth]VocabSizeEstimationPub2009 Hidaka, S. (2009). A Sample-size-invariant Estimation of Lexical Diversity. In Proceedings of The Thirty First Annual Meeting of Cognitive Science Society. [Publications]

Top

日本語

Research topics

Publications

Lecture

Links

References

Writing Group

Chat Group

最新の10件

Statistical Estimation of Vocabulary Size Including "Unseen" Words. †

Keywords †

観察されない語彙を含む語彙数の推定法 †

Related papers (See also other publications/ 関連する発表論文 (その他の論文など) †