BERT-BASED AUTHORSHIP ATTRIBUTION ON THE ROMANIAN DATASET CALLED ROST
DOI:
https://doi.org/10.24193/subbi.2025.03Keywords:
authorship attribution, BERT, ROSTAbstract
Having been around for decades, the problem of authorship attribution remains a current focus. Some of the more recent instruments used are the pre-trained language models, the most prevalent being BERT. Here we used such a model to detect the authorship of texts written in the Romanian language. The dataset used is highly unbalanced, i.e., significant differences in the number of texts per author, the sources from which the texts were collected, the period in which the authors lived and wrote these texts, the medium intended to be read (i.e., paper or online), and the type of writing (i.e., stories, short stories, fairy tales, novels, literary articles, and sketches). The results are better than expected, sometimes exceeding 87% macro-accuracy.
2010 Mathematics Subject Classification. 68T50.
1998 CR Categories and Descriptors. I.2.7 [Artificial Intelligence]: Natural Language Processing – Text Analysis; I.2.6 [Artificial Intelligence]: Learning – Induction.
References
[1] Wilson Alves de Oliveira Jr, Edson Justino, and Luiz Sá de Oliveira. Comparing compression models for authorship attribution. Forensic science international, 228(1-3):100–104, 2013.
[2] Mike Kestemont, Enrique Manjavacas, Ilia Markov, Janek Bevendorff, Matti Wiegmann, Efstathios Stamatatos, Benno Stein, and Martin Potthast. Overview of the cross-domain authorship verification task at pan 2021. In CLEF (Working Notes), 2021.
[3] Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, Günther Specht, Benno Stein, and Martin Potthast. Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection. In Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon, France, September 10-14, 2018/Cappellato, Linda [edit.]; et al., pages 1–25, 2018.
[4] Jacob Tyo, Bhuwan Dhingra, and Zachary C Lipton. On the state of the art in authorship attribution and authorship verification. arXiv preprint arXiv:2209.06869, 2022.
[5] Georgios Barlas and Efstathios Stamatatos. A transfer learning approach to cross-domain authorship attribution. Evolving Systems, 12(3):625–643, 2021.
[6] Malik Altakrori, Jackie Chi Kit Cheung, and Benjamin C. M. Fung. The topic confusion task: A novel evaluation scenario for authorship attribution. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4242–4256, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
[7] Benjamin Murauer and Günther Specht. Developing a benchmark for reducing data bias in authorship attribution. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 179–188, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
[8] Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, and Martin Potthast. The importance of suppressing domain style in authorship analysis. arXiv preprint arXiv:2005.14714, 2020.
[9] Efstathios Stamatatos. Masking topic-related information to enhance authorship attribution. Journal of the Association for Information Science and Technology, 69(3):461–473, 2018.
[10] Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. Surveying stylometry techniques and applications. ACM Computing Surveys (CSuR), 50(6):1–36, 2017.
[11] Oren Halvani and Lukas Graner. Cross-domain authorship attribution based on compression. Working Notes of CLEF, 2018.
[12] Maël Fabien, Esau Villatoro-Tello, Petr Motlicek, and Shantipriya Parida. BertAA: BERT fine-tuning for authorship attribution. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 127–137, Indian Institute of Technology Patna, Patna, India, December 2020. NLP Association of India (NLPAI).
[13] Georgios Barlas and Efstathios Stamatatos. Cross-domain authorship attribution using pre-trained language models. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 255–266. Springer, 2020.
[14] Sanda-Maria Avram and Mihai Oltean. A comparison of several ai techniques for authorship attribution on romanian texts. Mathematics, 10(23):1–35, 2022.
[15] Sanda-Maria Avram. Rost (romanian stories and other texts), 2022.
[16] Sanda Avram. Computing macro-accuracy of mep results on rost, 2023. Software available at https://github.com/sanda-avram/ROST-source-code/blob/main/ROST_withMEPX.ipynb.
[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[18] Douglas Bagnall. Author identification using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891, 2015.
[19] Jianfeng Zhang, Yan Zhu, Xiaoping Zhang, Ming Ye, and Jinzhong Yang. Developing a long short-term memory (lstm) based model for predicting water table depth in agricultural areas. Journal of hydrology, 561:918–929, 2018.
[20] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[21] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
[22] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[23] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[25] Andrei Manolache, Florin Brad, Elena Burceanu, Antonio Barbalau, Radu Ionescu, and Marius Popescu. Transferring bert-like transformers’ knowledge for authorship verification. arXiv preprint arXiv:2112.05125, 2021.
[26] Jacob Tyo. Computing the macro-accuracy, January 2023. personal communication.
[27] Chris McCormick. Bert word embeddings tutorial, 2019. Available at http://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/.
[28] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[29] Ekaba Bisong and Ekaba Bisong. Google colaboratory. Building machine learning and deep learning models on google cloud platform: a comprehensive guide for beginners, pages 59–64, 2019.
[30] Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online, November 2020. Association for Computational Linguistics.
[31] Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. bert-base-romanian-cased-v1, 2023. Available at https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1.
[32] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Studia Universitatis Babeș-Bolyai Informatica

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.