Beyond Accuracy: Behavioral Testing of NLP Models with CheckList (Best Paper ACL 2020)

ACL 2020

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList (Best Paper ACL 2020)

May 20, 2021
|
37 views
Details
#checklist #evaluation #software Is accuracy enough to test NLP models? Clearly NOT, and this paper talks exactly that. Current state-of-the-art NLP commercial and research models both have some serious limitations when tested using this framework. Watch the video to know more :) This paper also won the “Best Paper Award” at ACL 2020 conference. ⏩ Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it. Please feel free to share out the content and subscribe to my channel :) ⏩ Subscribe - https://youtube.com/channel/UCoz8NrwgL7U9535VNc0mRPA?sub_confirmation=1 ⏩ OUTLINE: 0:00 - Abstract and Methodology 4:57 - Sentiment Classification SOTA models performance 11:06 - Quora Question Pair Classification SOTA models performance 13:08 - My thoughts ⏩ Paper Title: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList ⏩ Paper: https://www.aclweb.org/anthology/2020.acl-main.442.pdf ⏩ Code: https://github.com/marcotcr/checklist ⏩ Author: Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh ⏩ Organisation: Microsoft Research, University of Washington, University of California, Irvine ⏩ IMPORTANT LINKS Blog: https://link.medium.com/2foaMWiyHdb Evaluating Text Generation Systems: https://www.youtube.com/watch?v=-CIlz-5um7U&list=PLsAqq9lZFOtXlzg5RNyV00ueE89PwnCbu ********************************************* If you want to support me financially which totally optional and voluntary :) ❤️ You can consider buying me chai ( because i don't drink coffee :) ) at https://www.buymeacoffee.com/TechvizCoffee ********************************************* ⏩ Youtube - https://www.youtube.com/c/TechVizTheDataScienceGuy ⏩ Blog - https://prakhartechviz.blogspot.com ⏩ LinkedIn - https://linkedin.com/in/prakhar21 ⏩ Medium - https://medium.com/@prakhar.mishra ⏩ GitHub - https://github.com/prakhar21 ⏩ Twitter - https://twitter.com/rattller ********************************************* Please feel free to share out the content and subscribe to my channel :) ⏩ Subscribe - https://youtube.com/channel/UCoz8NrwgL7U9535VNc0mRPA?sub_confirmation=1 Tools I use for making videos :) ⏩ iPad - https://tinyurl.com/y39p6pwc ⏩ Apple Pencil - https://tinyurl.com/y5rk8txn ⏩ GoodNotes - https://tinyurl.com/y627cfsa #techviz #datascienceguy #ai #researchpaper #naturallanguageprocessing #accuracy

0:00 Abstract and Methodology 4:57 Sentiment Classification SOTA models performance 11:06 Quora Question Pair Classification SOTA models performance 13:08 My thoughts
Comments
loading...