OLDIES BUT GOLDIES: THE POTENTIAL OF CHARACTER N-GRAMS FOR ROMANIAN TEXTS

Authors

  • Dana LUPȘA Department of Computer Science, Babeş-Bolyai University, Cluj-Napoca, Romania, dana.lupsa@ubbcluj.ro
  • Sanda-Maria AVRAM Department of Computer Science, Babeş-Bolyai University, Cluj-Napoca, Romania, sanda.avram@ubbcluj.ro https://orcid.org/0000-0002-2007-1661
  • Radu LUPȘA Department of Computer Science, Babeş-Bolyai University, Cluj-Napoca, Romania, radu.lupsa@ubbcluj.ro

DOI:

https://doi.org/10.24193/subbi.2025.02

Keywords:

authorship attribution, machine learning, character N-gram

Abstract

This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques — Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource-constrained or under-studied language settings.

2010 Mathematics Subject Classification. 68T50.
1998 CR Categories and Descriptors. I.2.7 [Artificial Intelligence]: Natural Language Processing – Text Analysis; I.2.6 [Artificial Intelligence]: Learning – Induction.

References

[1] Altman, N. S. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175–185.

[2] Avram, S.-M. Bert-based authorship attribution on the romanian dataset called rost. arXiv preprint arXiv:2301.12500 (2023).

[3] Avram, S.-M., and Oltean, M. A comparison of several ai techniques for authorship attribution on romanian texts. Mathematics 10, 23 (2022), 4589.

[4] Boser, B. E., Guyon, I. M., and Vapnik, V. N. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory (1992), pp. 144–152.

[5] Dinu, L. P., Popescu, M., and Dinu, A. Authorship identification of romanian texts with controversial paternity. In LREC (2008).

[6] Fix, E., and Hodges, J. J. Discriminatory analysis: Non-parametric discrimination: Consistency properties. Tech. rep., USAF School of Aviation Medicine, 1951.

[7] Fix, E., and Hodges, J. J. Discriminatory analysis: Non-parametric discrimination: Small sample performance. Tech. rep., USAF School of Aviation Medicine, 1952.

[8] Houvardas, J., and Stamatatos, E. N-gram feature selection for authorship identification. In Artificial Intelligence: Methodology, Systems, Applications (2006).

[9] Howedi, F., and Mohd, M. Text classification for authorship attribution using naive bayes classifier with limited training data. computer engineering and intelligent systems 5, 4 (2014), 48–56.

[10] Kestemont, M. Function words in authorship attribution. from black magic to theory? In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL) (2014), pp. 59–66.

[11] Kestemont, M., Tschuggnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B., and Potthast, M. Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection. In Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon, France, September 10-14, 2018/Cappellato, Linda [edit.]; et al. (2018), pp. 1–25.

[12] Koppel, M., Schler, J., and Argamon, S. Computational methods in authorship attribution. Journal of the American Society for information Science and Technology 60, 1 (2009), 9–26.

[13] Misini, A., Canhasi, E., Kadriu, A., and Fetahi, E. Automatic authorship attribution in albanian texts. Plos one 19, 10 (2024), e0310057.

[14] Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y., and Woodard, D. Surveying stylometry techniques and applications. ACM Computing Surveys (CSuR) 50, 6 (2017), 1–36.

[15] Nitu, M., and Dascalu, M. Authorship attribution in less-resourced languages: A hybrid transformer approach for romanian. Applied Sciences 14, 7 (2024), 2700.

[16] Posadas Durán, J., Gomez Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., and Chanona-Hernández, L. Application of the distributed document representation in the authorship attribution task for small corpora. Soft Computing 21 (02 2017).

[17] Potthast, M., Barrón-Cedeno, A., Stein, B., and Rosso, P. Cross-language plagiarism detection. Language Resources and Evaluation 45 (2011), 45–62.

[18] Quinlan, J. R. Induction of decision trees. Machine learning 1, 1 (1986), 81–106.

[19] Ramezani, R., Sheydaei, N., and Kahani, M. Evaluating the effects of textual features on authorship attribution accuracy. In ICCKE 2013 (2013), pp. 108–113.

[20] Sapkota, U., Bethard, S., Montes, M., and Solorio, T. Not all character n-grams are created equal: A study in authorship attribution. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies (2015), pp. 93–102.

[21] Sari, Y., Vlachos, A., and Stevenson, M. Continuous n-gram representations for authorship attribution. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers (Valencia, Spain, Apr. 2017), M. Lapata, P. Blunsom, and A. Koller, Eds., Association for Computational Linguistics, pp. 267–273.

[22] Stamatatos, E. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.

[23] Stamatatos, E. On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy 21, 2 (01 2013), 421–439.

[24] Wanwan, Z., and Jin, M. A review on authorship attribution in text mining. Wiley Interdisciplinary Reviews: Computational Statistics 15 (04 2022).

[25] Zhang, X., Zhao, J., and LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 649–657.

[26] Zurada, J. M. Introduction to artificial neural systems, 1992.

Downloads

Published

2025-07-21

How to Cite

LUPȘA, D., AVRAM, S.-M., & LUPȘA, R. (2025). OLDIES BUT GOLDIES: THE POTENTIAL OF CHARACTER N-GRAMS FOR ROMANIAN TEXTS. Studia Universitatis Babeș-Bolyai Informatica, 70(1-2), 25–42. https://doi.org/10.24193/subbi.2025.02

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.