Statistical Estimation of Vocabulary Size Including "Unseen" Words.

The number of unique words in children’s speech is one of most basic statistics indicating their language development. We may face, however, to a difficulty to accurately evaluate the number of unique words in a child’s growing corpus over time with a limited sample size. This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children. This technique utilizes statistical properties of the number of types as a function of the number of sampled tokens. We tested the practical effectiveness of the proposed method in the empirical data analysis of the cross-sectional and longitudinal samples. The converging empirical evidence suggests that the proposed estimator improves the accuracy of vocabulary size estimation over a naïve type-counting estimators. Utilizing this efficient estimator, we propose a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods.

Vocabulary growth; Small sample size; Number of latent types; Type–token ratio;


