Not Your Grandfather's Test Set: Reducing Labeling Effort for Testing