WORD AND PUNCTUATION N-GRAM FEATURES IN ROMANIAN AUTHORSHIP ATTRIBUTION
DOI:
https://doi.org/10.24193/subbi.2025.04Keywords:
authorship attribution, machine learning, N-gram featuresAbstract
This study addresses the problem of authorship attribution for Romanian texts, focusing on the use of N-gram features with an emphasis on semantic-independent representations. While character N-grams have been previously studied, this work extends the exploration to word and part-of-speech (POS) N-grams, as well as combinations involving punctuation, closed-class words, and filtered content words. Using the ROST corpus, we evaluate six supervised learning algorithms, with results aver- aged over multiple runs to ensure robustness. Our experiments show that Artificial Neural Networks (ANN) consistently achieve the highest performance, with word-based unigrams enhanced by punctuation reaching an average macro-accuracy of 0.93. Importantly, semantically independent features, such as closed-class words and POS replacements for nouns and verbs, yield small further improvements. These findings highlight the effectiveness of carefully designed N-gram features for Romanian AA and suggest that semantic-independent representations can complement traditional lexical approaches.
2010 Mathematics Subject Classification. 68T50.
1998 CR Categories and Descriptors. code [Artificial Intelligence]: Natural Language Processing – Text Analysis; code [Artificial Intelligence]: Learning – Induction.
References
[1] Avram, S.-M. Bert-based authorship attribution on the romanian dataset called rost.
arXiv preprint arXiv:2301.12500 (2023).
[2] Avram, S.-M., and Oltean, M. A comparison of several ai techniques for authorship attribution on romanian texts. Mathematics 10, 23 (2022), 4589.
[3] Briciu, A., Czibula, G., and Lupea, M. AutoAt: A deep autoencoder-based classification model for supervised authorship attribution. Procedia Computer Science 192 (10 2021), 397–406.
[4] De Marneffe, M.-C., Nivre, J., and Zeman, D. Function words in universal dependencies. Linguistic Analysis 43, 3–4 (2024), 549–588.
[5] Drexler, E. Qnrs: Toward language for intelligent machines, 2021.
[6] He, X., Lashkari, A. H., Vombatkere, N., and Sharma, D. P. Authorship attribution methods, challenges, and future research directions: A comprehensive survey. Information 15, 3 (2024).
[7] Houvardas, J., and Stamatatos, E. N-gram feature selection for authorship identification. In Artificial Intelligence: Methodology, Systems, Applications (2006).
[8] Howedi, F., and Mohd, M. Text classification for authorship attribution using naive bayes classifier with limited training data. Computer Engineering and Intelligent Systems 5 (2014), 48–56.
[9] Koppel, M., Schler, J., and Argamon, S. Computational methods in authorship attribution. Journal of the American Society for information Science and Technology 60, 1 (2009), 9–26.
[10] López-Anguita, R., Montejo-Ráez, A., and Díaz-Galiano, M. C. Complexity measures and pos n-grams for author identification in several languages: Sinai at pan@clef 2018. In Conference and Labs of the Evaluation Forum (2018).
[11] Lupsa, D., Avram, S.-M., and Lupsa, R. Oldies but goldies: The potential of character n-grams for romanian texts. Studia Universitatis Babes,-Bolyai Informatica 70, 1-2 (2025), 25–42.
[12] Misini, A., Kadriu, A., and Canhasi, E. A survey on authorship analysis tasks and techniques. SEEU Review 17 (12 2022), 153–167.
[13] Niculescu, O., and Vasileanu, M. Prolongation in Romanian. In Interspeech 2025
(2025), pp. 379–383.
[14] Nitu, M., and Dascalu, M. Authorship attribution in less-resourced languages: A hybrid transformer approach for romanian. Applied Sciences 14, 7 (2024), 2700.
[15] Stamatatos, E. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.
[16] Stamatatos, E. On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy 21, 2 (01 2013), 421–439.
[17] Wanwan, Z., and Jin, M. A review on authorship attribution in text mining. Wiley Interdisciplinary Reviews: Computational Statistics 15 (04 2022).
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Studia Universitatis Babeș-Bolyai Informatica

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.