BERTWEETRO: PRE-TRAINED LANGUAGE MODELS FOR ROMANIAN SOCIAL MEDIA CONTENT

Dan Claudiu NEAGU

doi:10.2478/subboec-2025-0005

Authors

Dan Claudiu NEAGU Faculty of Economics and Business Administration, Babeș-Bolyai University, Cluj-Napoca, Romania. Email: dan.neagu@econ.ubbcluj.ro https://orcid.org/0000-0003-3434-4859

DOI:

https://doi.org/10.2478/subboec-2025-0005

Keywords:

machine learning, natural language processing, language models, transformers, text classification, under-resourced languages

Abstract

The introduction of Transformers, like BERT or RoBERTa, have revolutionized NLP due to their ability to better “understand” the meaning of texts. These models are created (pre-trained) in a self-supervised manner on large scale data to predict words in a sentence but can be adjusted (fine-tuned) for other specific NLP applications. Initially, these models were created using literary texts but very quickly the need to process social media content emerged. Social media texts have some problematic characteristics (they are short, informal, filled with typos, etc.) which means that a traditional BERT model will have problems when dealing with this type of input. For this reason, dedicated models need to be pre-trained on microblogging content and many such models have been developed in popular languages like English or Spanish. For under-represented languages, like Romanian, this is more difficult to achieve due to the lack of open-source resources. In this paper we present our efforts in pre-training from scratch 8 BERTweetRO models, based on RoBERTa architecture, with the help of a Romanian tweets corpus. To evaluate our models, we fine-tune them on 2 down-stream tasks, Sentiment Analysis (with 3 classes) and Topic Classification (with 26 classes), and compare them against Multilingual BERT plus a number of other popular classic and deep learning models. We include a commercial solution in this comparison and show that some BERTweetRO variants and almost all models trained on the translated data have a better accuracy than the commercial solution. Our best performing BERTweetRO variants place second after Multilingual BERT in most of our experiments, which is a good result considering that our Romanian corpus used for pre-training is relatively small, containing around 51,000 texts.

JEL classification: C45, C55, C88, O33

References

Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modelling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74-88.

Albanese, F., & Feuerstein, E. (2021). Improved topic modelling in twitter through community pooling. In String Processing and Information Retrieval: 28th International Symposium, SPIRE 2021, Lille, France, October 4–6, 2021, Proceedings 28 (pp. 209-216). Springer International Publishing.

Alfred, V. A., Monica, S. L., & Jeffrey, D. U. (2007). Compilers principles, techniques & tools. pearson Education.

Athiwaratkun, B., Wilson, A. G., & Anandkumar, A. (2018). Probabilistic fasttext for multi-sense word embeddings. arXiv preprint arXiv:1806.02901.

Barriere, V., & Balahur, A. (2020). Improving sentiment analysis over non-English tweets using multilingual transformers and automatic translation for data-augmentation. arXiv preprint arXiv:2010.03486.

Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.

Blei, D., & Lafferty, J. (2006). Correlated topic models. Advances in neural information processing systems, 18, 147.

Bochinski, E., Senst, T., & Sikora, T. (2017). Hyper-parameter optimization for convolutional neural network committees based on evolutionary algorithms. In 2017 IEEE international conference on image processing (ICIP) (pp. 3924-3928). IEEE.

Boyd-Graber, J., & Blei, D. (2008). Syntactic topic models. Advances in neural information processing systems, 21.

Briciu, A., Călin, A. D., Miholca, D. L., Moroz-Dubenco, C., Petrașcu, V., & Dascălu, G. (2024). Machine-Learning-Based Approaches for Multi-Level Sentiment Analysis of Romanian Reviews. Mathematics, 12(3), 456.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. arXiv preprint arXiv:2010.02559.

Cheng, X., Yan, X., Lan, Y., & Guo, J. (2014). Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 26(12), 2928-2941.

Ciobotaru, A., & Dinu, L. P. (2023). SART & COVIDSentiRo: Datasets for Sentiment Analysis Applied to Analyzing COVID-19 Vaccination Perception in Romanian Tweets. Procedia Computer Science, 225, 1331-1339.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186).

Dimitrov, D., Baran, E., Fafalios, P., Yu, R., Zhu, X., Zloch, M., & Dietze, S. (2020). Tweetscov19-a knowledge base of semantically annotated tweets about the covid-19 pandemic. In Proceedings of the 29th ACM international conference on information & knowledge management (pp. 2991-2998).

Dingliwal, S., Shenoy, A., Bodapati, S., Gandhe, A., Gadde, R. T., & Kirchhoff, K. (2021). Prompt Tuning GPT-2 language model for parameter-efficient domain adaptation of ASR systems. arXiv preprint arXiv:2112.08718.

Dumitrescu, S. D., Avram, A. M., & Pyysalo, S. (2020). The birth of Romanian BERT. arXiv preprint arXiv:2009.08712.

Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the association for computational linguistics: Human language technologies (pp. 359-369).

Erlingsson, Ú., Feldman, V., Mironov, I., Raghunathan, A., Talwar, K., & Thakurta, A. (2019). Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 2468-2479). Society for Industrial and Applied Mathematics.

Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23-38.

Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42-47.

Gentzkow, M., Kelly, B., & Taddy, M. (2019). Text as data. Journal of Economic Literature, 57(3), 535-574.

Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.

Gupta, M. R., Bengio, S., & Weston, J. (2014). Training highly multiclass classifiers. The Journal of Machine Learning Research, 15(1), 1461-1492.

Hamborg, F., Donnay, K., & Merlo, P. (2021). NewsMTSC: a dataset for (multi-) target-dependent sentiment classification in political news articles. Association for Computational Linguistics (ACL).

He, P., Liu, X., Gao, J., & Chen, W. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.

Ho, V. A., Nguyen, D. H. C., Nguyen, D. H., Pham, L. T. V., Nguyen, D. V., Nguyen, K. V., & Nguyen, N. L. T. (2020). Emotion recognition for vietnamese social media text. In Computational Linguistics: 16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019, Hanoi, Vietnam, October 11–13, 2019, Revised Selected Papers 16 (pp. 319-333). Springer Singapore.

Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88).

Istrati, L., & Ciobotaru, A. (2022). Automatic monitoring and analysis of brands using data extracted from twitter in Romanian. In Intelligent Systems and Applications: Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 3 (pp. 55-75). Springer International Publishing.

Izsak, P., Berchansky, M., & Levy, O. (2021). How to train BERT with an academic budget. arXiv preprint arXiv:2104.07705.

Lee, K., Palsetia, D., Narayanan, R., Patwary, M. M. A., Agrawal, A., & Choudhary, A. (2011). Twitter trending topic classification. In 2011 IEEE 11th international conference on data mining workshops (pp. 251-258). IEEE.

Leskovec, J., Rajaraman, A., & Ullman, J. D. (2020). Mining of massive data sets. Cambridge university press.

Levine, Y., Lenz, B., Lieber, O., Abend, O., Leyton-Brown, K., Tennenholtz, M., & Shoham, Y. (2020). Pmi-masking: Principled masking of correlated spans. arXiv preprint arXiv:2010.01825.

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. A. (2022). Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35, 1950-1965.

Liu, X., He, P., Chen, W., & Gao, J. (2019). Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482.

Masala, M., Ruseti, S., & Dascalu, M. (2020). Robert–a romanian bert model. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 6626-6637).

Mori, N., Takeda, M., & Matsumoto, K. (2005). A comparison study between genetic algorithms and bayesian optimize algorithms by novel indices. In Proceedings of the 7th annual conference on Genetic and evolutionary computation (pp. 1485-1492).

Neagu, D. C., Rus, A. B., Grec, M., Boroianu, M. A., Bogdan, N., & Gal, A. (2022). Towards sentiment analysis for romanian twitter content. Algorithms, 15(10), 357.

Neagu, D. C., Rus, A. B., Grec, M., Boroianu, M., & Silaghi, G. C. (2022). Topic Classification for Short Texts. In International Conference on Information Systems Development (pp. 207-222). Cham: Springer International Publishing.

Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.

Oh, S. (2017). Top-k hierarchical classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1).

Ojha, V. K., Abraham, A., & Snášel, V. (2017). Metaheuristic design of feedforward neural networks: A review of two decades of research. Engineering Applications of Artificial Intelligence, 60, 97-116.

Paaß, G., & Giesselbach, S. (2023). Pre-trained Language Models. In Foundation Models for Natural Language Processing: Pre-trained Language Models Integrating Media (pp. 19-78). Cham: Springer International Publishing.

Pelikan, M., Goldberg, D. E., & Lobo, F. G. (2002). A survey of optimization by building and using probabilistic models. Computational optimization and applications, 21, 5-20.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67.

Rahman, M. A., & Akter, Y. A. (2019). Topic classification from text using decision tree, K-NN and multinomial naïve bayes. In 2019 1st international conference on advances in science, engineering and robotics technology (ICASERT) (pp. 1-4). IEEE.

Raschka, S. (2021). Model evaluation, model selection, and algorithm selection in machine learning. arXiv 2018. arXiv preprint arXiv:1811.12808.

Tani, L., Rand, D., Veelken, C., & Kadastik, M. (2021). Evolutionary algorithms for hyperparameter optimization in machine learning for application in high energy physics. The European Physical Journal C, 81, 1-9.

Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Wei, J., Wang, X., ... & Metzler, D. (2022). Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.

Vasile, A., Rădulescu, R., & Păvăloiu, I. B. (2014). Topic classification in Romanian blogosphere. In 12th Symposium on Neural Network Applications in Electrical Engineering (NEUREL) (pp. 131-134). IEEE.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Vayansky, I., & Kumar, S. A. (2020). A review of topic modeling methods. Information Systems, 94, 101582.

Velankar, A., Patil, H., & Joshi, R. (2022). Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi. In IAPR workshop on artificial neural networks in pattern recognition (pp. 121-128). Cham: Springer International Publishing.

Wei, J., Garrette, D., Linzen, T., & Pavlick, E. (2021). Frequency effects on syntactic rule learning in transformers. arXiv preprint arXiv:2109.07020.

Wettig, A., Gao, T., Zhong, Z., & Chen, D. (2022). Should you mask 15% in masked language modeling?. arXiv preprint arXiv:2202.08005.

Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 (pp. 818-833). Springer International Publishing.

Zeng, J., Li, J., Song, Y., Gao, C., Lyu, M. R., & King, I. (2018). Topic memory networks for short text classification. arXiv preprint arXiv:1809.03664.

Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into deep learning. Cambridge University Press.

Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: a statistical framework. International journal of machine learning and cybernetics, 1, 43-52.

Zhao, J., Liu, K., & Xu, L. (2016). Sentiment analysis: Mining opinions, sentiments, and emotions.

BERTWEETRO: PRE-TRAINED LANGUAGE MODELS FOR ROMANIAN SOCIAL MEDIA CONTENT

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Make a Submission

Links

Information

Browse

Developed By