HardML: A Benchmark for Evaluating Data Science and Machine Learning Knowledge and Reasoning in AI

Authors

DOI:

https://doi.org/10.24193/subbi.2024.2.04

Keywords:

Large Language Models, Machine Learning Education, Multiple Choice Benchmark, NLP Benchmarks, Evaluation of AI Systems

Abstract

We present HardML, a benchmark designed to evaluate the knowledge and reasoning abilities in the fields of data science and machine learning. HardML comprises a diverse set of 100 challenging multiple- choice questions, handcrafted over a period of 6 months, covering the most popular and modern branches of data science and machine learning. These questions are challenging even for a typical Senior Machine Learning Engineer to answer correctly. To minimize the risk of data contamination, HardML uses mostly original content devised by the author. Current state-of-the-art AI models achieve a 30% error rate on this benchmark, which is about 3 times larger than the one achieved on the equivalent, well-known MMLU-ML. While HardML is limited in scope and not aiming to push the frontier—primarily due to its multiple-choice nature—it serves as a rigorous and modern testbed to quantify and track the progress of top AI. While plenty benchmarks and experimentation in LLM evaluation exist in other STEM fields like mathematics, physics and chemistry, the sub-fields of data science and machine learning remain fairly underexplored.

Received by editors: 22 January 2025.

2020 Mathematics Subject Classification. 68T50, 68T07,68T05,68T20

References

1. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL- HLT. Association for Computational Linguistics.

2. Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, 33, 1877–1901.

3. Wang, A., Singh, A., Michael, J., et al. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the EMNLP Workshop. Association for Computational Linguistics.

4. Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140), 1–67.

5. Hendrycks, D., Burns, C., Basart, S., et al. (2021). Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations (ICLR).

6. Hendrycks, D., Burns, C., Kadavath, S., et al. (2021). Measuring Mathematical Problem Solving with the MATH Dataset. In Advances in Neural Information Processing Systems.

7. Saxton, D., Grefenstette, E., Hill, F., & Kohli, P. (2019). Analysing Mathematical Reasoning Abilities of Neural Models. In International Conference on Learning Representations (ICLR).

8. Huang, K., Altosaar, J., & Ranganath, R. (2020). ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv preprint arXiv:1904.05342.

9. Provost, F., & Fawcett, T. (2013). Data Science and its Relationship to Big Data and Data-Driven Decision Making. Big Data, 1(1), 51–59.

10. Jordan, M. I., & Mitchell, T. M. (2015). Machine Learning: Trends, Perspectives, and Prospects. Science, 349(6245), 255–260.

11. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

12. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

13. Dodge, J., Ilharco, G., Schwartz, R., et al. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In Proceedings of the 2021 EMNLP Workshop on Datasets and Benchmarks. Association for Computational Linguistics.

14. He, T., Singh, M., Achiam, J., et al. (2020). Translatotron: An End-to-End Speech-to-Speech Translation Model. arXiv preprint arXiv:1904.06037.

15. Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., de Oliveira Santos, E., Järviniemi, O., Barnett, M., Sandler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., Enugandla, S. V. V., & Wildon, M. (2024). FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI. arXiv preprint arXiv:2411.04872. Retrieved from https://doi.org/10.48550/arXiv.2411.04872.

16. Chan, J. S., Chowdhury, N., Jaffe, O., et al. (2024). MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. arXiv preprint arXiv:2410.07095.

17. OpenAI. (2024). GPT-4o System Card. Retrieved from https://cdn.openai.com/gpt-4o-system-card.pdf

18. Anthropic. (2024). Introducing Claude. Retrieved from https://www.anthropic.com/news/introducing-claude

19. OpenAI. (2024). Hello GPT-4o. Retrieved from https://openai.com/index/hello-gpt-4o/

20. OpenAI. (2024). Introducing OpenAI o1. Retrieved from https://openai.com/index/introducing-openai-o1-preview/?utmsource=chatgpt.com

21. OpenAI. (2024). GPT-4o Mini: Advancing Cost-Efficient Intelligence. Retrieved from https://openai.com/blog/gpt-4o-mini-advancing-cost-efficient-intelligence

22. Meta AI. (2024). Introducing Meta Llama 3: The most capable openly available LLM to date. Retrieved from https://ai.meta.com/blog/llama-3/

23. Lu, Pan et al. (2024). MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. arXiv:2310.02255 [cs.CV]. URL: https://arxiv.org/abs/2310.02255.

24. (2024c). Learning to Reason with LLMs. URL: https://openai.com/index/learning-to-reason-withllms.

25. Drori, I., Zhang, S. J., Shuttleworth, R., et al. (2022). From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams. arXiv preprint arXiv:2206.05442. Retrieved from https://doi.org/10.48550/arXiv.2206.05442.

26. Pfister, R., & Jud, H. (2025). Understanding and Benchmarking Artificial Intelligence: OpenAI’s o3 Is Not AGI. arXiv preprint arXiv:2501.07458. Retrieved from https://arxiv.org/abs/2501.07458.

Downloads

Published

2025-03-16

How to Cite

PRICOPE, T.-V. (2025). HardML: A Benchmark for Evaluating Data Science and Machine Learning Knowledge and Reasoning in AI. Studia Universitatis Babeș-Bolyai Informatica, 69(2), 59–76. https://doi.org/10.24193/subbi.2024.2.04

Issue

Section

Articles

Similar Articles

1 2 3 4 5 6 7 8 9 > >> 

You may also start an advanced similarity search for this article.