Will It Unblend? An EMNLP 2020 Paper

EMNLP 2020

"Will It Unblend" is research conducted by Yuval Pinter, Cassandra Jacobs, Jacob Eisenstein at Georgia Tech and University of Wisconsin. Abstract: Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV blends to quantify the difficulty of interpreting the meanings of blends by large-scale contextual language models such as BERT. We first show that BERT's processing of these blends does not fully access the component meanings, leaving their contextual representations semantically impoverished. We find this is mostly due to the loss of characters resulting from blend formation. Then, we assess how easily different models can recognize the structure and recover the origin of blends, and find that context-aware embedding systems outperform character-level and context-free embeddings, although their results are still far from satisfactory. This work was accepted to the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Pinter is affiliated with the Machine Learning Center at Georgia Tech. About ML@GT The Machine Learning Center was founded in 2016 as an interdisciplinary research center (IRC) at the Georgia Institute of Technology. Since then, we have grown to include over 190 affiliated faculty members and 60 Ph.D. students, all publishing at world-renowned conferences. The center aims to research and develop innovative and sustainable technologies using machine learning and artificial intelligence (AI) that serve our community in socially and ethically responsible ways. www.ml.gatech.edu