Why benchmarks are crucial for progress in AI and how to design good one for ES