Abstract: With the ever growing amounts of textual data from a large variety of languages, domains and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure consistent performance across hetero-geneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions. In this paper we propose a Replicability Analysis frame-work for a statistically sound analysis of multiple comparisons between algorithms for NLP tasks. We discuss the theoretical advantages of this framework over the current, statistically unjustified, practice in the NLP literature, and demonstrate its empirical value across four applications: multi-domain dependency parsing, multilingual POS tag-ging, cross-domain sentiment classification and word similarity prediction.
Authors: Rotem Dror, Gili Baumer, Marina Bogomolov, Roi Reichart (IIT)