Beyond Accuracy: Grounding Evaluation Metrics for Human-Machine Learning Systems