Our EMNLP presentation for:
Michelle Yuan, Hsuan-Tien Lin, and Jordan Boyd-Graber. Cold-start Active Learning through SelfSupervised Language Modeling. Empirical Methods in Natural Language Processing, 2020
Featuring Michelle Yuan
Labeling data is a fundamental bottleneck in machine learning, especially for NLP, due to annotation cost and time. For medical text, obtaining labeled data is challenging because of privacy issues or shortage in expertise. Thus, active learning can be employed to recognize the most relevant examples and then query labels from an oracle. However, developing a strategy for selecting examples to label is non-trivial. Active learning is difficult to use in cold-start; all examples confuse the model because it has not trained on enough data. Fortunately, modern NLP provides an additional source of information: pre-trained language models. In our paper, we propose an active learning strategy called ALPS to find sentences that perplex the language model. We evaluate our approach on sentence classification datasets spanning across different domains. Results show that ALPS is an efficient active learning strategy that is competitive with state-of-the-art approaches.