Sunday, October 17, 2021

Finding of Michele Banko and Eric Brill, Microsoft Research 2001 paper summarized

"Scaling to Very Very Large Corpora for
Natural Language Disambiguation", Michele Banko and Eric Brill, Microsoft Research, 2001 paper


This NLP paper seems to show that as the quantity of high quality training data increased, the test accuracy of all models improve significantly, for complex models and even simple models. To use data to achieve high quality result in NLP, the vocabulary size (unique words) should exceed the current state-of-art one million of words, when there are "hundreds of billions of words" readily available on the internet and the size of the vocab continues to grow. 

1 comment:

  1. The company's support workforce is always notable. They have bailed me out of quite a few issues--lots of my very own making. web design agencies in minneapolis

    ReplyDelete