"Scaling to Very Very Large Corpora forNatural Language Disambiguation", Michele Banko and Eric Brill, Microsoft Research, 2001 paper
This NLP paper seems to show that as the quantity of high quality training data increased, the test accuracy of all models improve significantly, for complex models and even simple models. To use data to achieve high quality result in NLP, the vocabulary size (unique words) should exceed the current state-of-art one million of words, when there are "hundreds of billions of words" readily available on the internet and the size of the vocab continues to grow.