Pydsbuilder – A Dataset Builder Written in Python Django
DOI:
https://doi.org/10.24193/subbi.2024.2.01Keywords:
Data Mining, Software Quality Analysis Tools, Software Quality, Datasets, Dataset Builder, GitHub MiningAbstract
Data mining and the analysis of open-source projects have become crucial in recent research, driven by the vast availability of data across multiple programming domains. This paper focuses on two main objectives: first, to present an experience report for designing a software quality data mining tool, and secondly, to provide an open-source solution, PyDs, that facilitates the creation of datasets specifically aimed at analyzing software quality attributes. PyDs, leveraging Python and the Django Framework, provides a comprehensive solution for researchers, encompassing data extraction from repositories, the application of software analysis tools, and the consolidation of results into a coherent format conducive to in-depth experimentation and analysis. This tool addresses the pressing need for effective data mining capabilities in evaluating software quality, allowing the research community to harness the full potential of the vast resources offered by open-source software projects.
Received by editors: 13 September 2024
2010 Mathematics Subject Classification. 68N99.
1998 CR Categories and Descriptors. D2.0 [Software Engineering]: General – Standards; D2.9 [Software Engineering]: Management – Software Quality Assurance
References
1. The freebsd license, 2023.
2. Ansible, I., et al. Ansible: Radically simple IT automation. https://github.com/ ansible/ansible, 2023.
3. Atwi, H., Lin, B., Tsantalis, N., Kashiwa, Y., Kamei, Y., Ubayashi, N., Bavota, G., and Lanza, M. Pyref: Refactoring detection in python projects. In 2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM) (2021), pp. 136–141.
4. Chaturvedi, K., Sing, V., and Singh, P. Tools in mining software repositories. In 2013 13th International Conference on Computational Science and Its Applications (2013), pp. 89–98.
5. Django Software Foundation. Django.
6. Docker, Inc. Docker: Empowering app development for developers, 2023. Accessed: 2024-02-17.
7. Dueñas, S., Cosentino, V., Robles, G., and Gonzalez-Barahona, J. M. Perceval: software project data at your will. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceedings (New York, NY, USA, 2018), ICSE ’18, Association for Computing Machinery, p. 1–4.
8. Elmishali, A., Stern, R., and Kalech, M. An artificial intelligence paradigm for troubleshooting software bugs. Engineering Applications of Artificial Intelligence 69 (2018), 147–156.
9. Fiechter, A., Minelli, R., Nagy, C., and Lanza, M. Visualizing github issues. In 2021 Working Conference on Software Visualization (VISSOFT) (2021), pp. 155–159.
10. Gousios, G., Vasilescu, B., Serebrenik, A., and Zaidman, A. Lean ghtorrent: Github data on demand. pp. 384–387.
11. Jr., J. M., Santana, R., and Machado, I. Grumpy: an automated approach to simplify issue data analysis for newcomers. In Proceedings of the XXXV Brazilian Symposium on Software Engineering (New York, NY, USA, 2021), SBES ’21, Association for Computing Machinery, p. 33–38.
12. Kourtzanidis, S., Chatzigeorgiou, A., and Ampatzoglou, A. Reposkillminer: identifying software expertise from github repositories using natural language process- ing. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (New York, NY, USA, 2021), ASE ’20, Association for Computing Machinery, p. 1353–1357.
13. Krogh, G. v., and Spaeth, S. The open-source software phenomenon: Characteristics that promote research. The Journal of Strategic Information Systems 16, 3 (2007), 236–253.
14. Lenarduzzi, V., Lomio, F., Taibi, D., and Huttunen, H. On the fault proneness of sonarqube technical debt violations: A comparison of eight machine learning techniques. CoRR abs/1907.00376 (2019).
15. Lenarduzzi, V., Saarimäki, N., and Taibi, D. The technical debt dataset. In Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering (Sept. 2019), PROMISE’19, ACM.
16. McKinney, W., et al. pandas: a powerful Python data analysis toolkit. https:// github.com/pandas-dev/pandas, 2023.
17. Midha, V., and Palvia, P. Factors affecting the success of open-source software. Journal of Systems and Software 85, 4 (2012), 895–905.
18. Moldovan, V.-A., Berciu, L.-M., and Patcas, R.-D. The python software quality dataset. In 50th Euromicro Conference Series on Software Engineering and Advanced Applications (2024).
19. Molnar, A.-J., and Motogna, S. Long-term evaluation of technical debt in open- source software. In Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (New York, NY, USA, 2020), ESEM ’20, Association for Computing Machinery.
20. Molnar, A.-J., and Motogna, S. A study of maintainability in evolving open- source software. In Evaluation of Novel Approaches to Software Engineering (Cham, 2021), R. Ali, H. Kaindl, and L. A. Maciaszek, Eds., Springer International Publishing, p. 261–282.
21. RabbitMQ Team. Rabbitmq: Open-source message broker. https://www.rabbitmq.com, 2023. [Online; accessed 10-February-2024].
22. Rosa, G., Pascarella, L., Scalabrino, S., Tufano, R., Bavota, G., Lanza, M., and Oliveto, R. A comprehensive evaluation of szz variants through a developer- informed oracle. Journal of Systems and Software 202 (2023), 111729.
23. SonarSource. Sonarqube: Continuous code quality inspection tool, 2023. [Online; accessed 10-February-2024].
24. Spadini, D., Aniche, M., and Bacchelli, A. Pydriller: Python framework for mining software repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (New York, NY, USA, 2018), ESEC/FSE 2018, Association for Computing Machinery, p. 908–911.
25. Spinellis, D., Gousios, G., Karakoidas, V., Louridas, P., Adams, P. J., Samoladas, I., and Stamelos, I., Evaluating the quality of open-source software. Electronic Notes in Theoretical Computer Science 233 (2009), 5–28.
26. Wangoo, D. P. Artificial intelligence techniques in software engineering for automated software reuse and design. In 2018 4th International Conference on Computing Communication and Automation (ICCCA) (2018), pp. 1–4.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Studia Universitatis Babeș-Bolyai Informatica

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.