Arab World English Journal (AWEJ) Volume 11. Number4  December 2020                                            Pp.490- 507
DOI: https://dx.doi.org/10.24093/awej/vol11no4.31

 Full Paper PDF

 

Towards a Stylometric Authorship Recognition Model for the Social Media Texts in Arabic

Haroon Nasser Alsager
Department of English,
College of Science and Humanities
Prince Sattam Bin Abdulaziz University
Alkharj, Saudi Arabia

Abstract:
Numerous studies have been concerned with developing new authorship recognition systems to address the increasing rates of cybercrimes associated with the anonymous nature of social media platforms, which still offer the opportunity for the users not to reveal their true identities. Nevertheless, it is still challenging to identify the real authors of social media’s offensive and inappropriate content. These contents are usually very short; therefore, it is challenging for stylometric authorship systems to assign controversial texts to their real authors based on the salient and distinctive linguistic features and patterns within these contents. This research introduces a new stylometric authorship system that considers both the shortness of data and the peculiar linguistic properties of Arabic. A corpus of 20, 357 tweets from 134 Twitter users. A document clustering based on Document Index Graph (DIG) model was used to classify input patterns in the tweets that shared common linguistic features. A comparative analysis using Vector Space Clustering (VSC) model based on the Bag of Words (BOW) model, conventionally used in authorship recognition applications, was used. Results indicate that the proposed system is more accurate than other standard authorship systems mainly based on vector space clustering methods. It was also clear that the model had the advantage of providing complete information about the documents and the degree of overlap between every pair of documents, which was useful in determining the similarity between documents.
Keywords: Authorship recognition, cybercrime, document clustering, Document Index Graph, linguistic stylometry

Cite as: Alsager, H.N. (2020). Towards a Stylometric Authorship Recognition Model for the Social Media Texts in Arabic. Arab World English Journal11 (4) 490- 507.DOI: https://dx.doi.org/10.24093/awej/vol11no4.31

References
Agarwal, N., Dokoohaki, N., & Tokdemir, S. (Eds.). (2019). Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining. Cham: Switzerland: Springer.

Aggarwal, C. C., & Reddy, C. K. (2018). Data Clustering: Algorithms and Applications. New York: Chapman and Hall/CRC.

Alghamdi, H. & Selamat, A. (2019). Arabic Web Page Clustering: A Review. Journal of King Saud University-Computer and Information Sciences. 31 (1), 1-14

Altintas, K., Chan, F., & Patton, J. M. (2007). Language Change Quantification Using Time-separated Parallel Translations. Literary and Linguistic Computing, 22(4), 375-393. doi:10.1093/LLC/fqm026

Anderson, M. (2018). A Majority of Teens Have Experienced Some Form of Cyberbullying. Washington: Pew Research Center.

Argamon, S., & Olsen, M. (2006). Toward Meaningful Computing. Communications of ACM, 49(4), 33-35. doi:http://doi.acm.org/10.1145/1121949.1121972

Attardi, G., Di Marco, S., & Salvi, D. (1998). Categorization by Context. Journal of Universal Computer Science, 4 (9), 719–736.

Attardi, G., Gulli, A., & Sebastiani, F. (1999). Automatic Web Page Categorization by Link and Context Analysis. In C. Hutchison, & G. Lanzarone, (eds.), THAI 1999, 105–119

Attia, M. A. (2007). Arabic Tokenization System. Proceedings of the 5th Workshop on Important Unresolved Matters, Prague, Czech Republic, 65–72.

Attia, M. A. (2008). Handling Arabic morphological and syntactic ambiguities within the LFG frame-work with a view to machine translation, (Unpublished Ph.D. Dissertation). University of Manchester, Manchester.

Bhargava, M., Mehndiratta, P., & Asawa, K. (2013). Stylometric Analysis for Authorship Attribution on Twitter. V. Bhatnagar & S. Srinivasa, (Eds.), BDA 2013, LNCS 8302, 37–47

Bourahma, S., Mbarki, S., Mourchid, M., & Mouloudi, A. (2017). Syntactic Parsing of Simple Arabic Nominal Sentence Using the NooJ Linguistic Platform. In A. Lachkar, K. Bouzoubaa, A. Mazroui, A. Hamdani, & A. Lekhouaja (Eds.), Arabic Language Processing: From Theory to Practice (244-257). London: Springer.

Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, communication & society, 15(5), 662-679. http://doi.org/10.1080/1369118X.2012.678878

Brocardo, M. L., Traore, I., Saad, S., & Woungang, I. (2013). Authorship Verification for Short Messages using Stylometry. Proceedings of the IEEE International Conference on Computer, Information and Telecommunication System (CITS 2013), Piraeus-Athens, Greece.

Budinger, T. F., & Budinger, M. D. (2006). Ethics of Emerging Technologies: Scientific Facts and Moral Challenges. Hoboken: N. J.: John Wiley & Sons.

Burrows, J. F. (2003). Questions of Authorship: Attribution and Beyond A Lecture Delivered on the Occasion of the Roberto Busa Award ACH-ALLC 2001, New York. Computers and the Humanities, 37(1), 5-32. DOI: 10.1023/A:1021814530952

Burrows, J. F. (2005). Who Wrote Shamela? Verifying the Authorship of a Parodic Text. Literary and Linguistic Computing, 20(4), 437-450. doi:10.1093/LLC/fqi049

Burrows, J. F. (2007). All the Way Through Testing for Authorship in Different Frequency Strata. Literary and Linguistic Computing, 22(1), 27-47. doi:10.1093/LLC/fqi067

Castillo, E., Cervantes, O., Vilarino, D., & Baez, D. (2015). Author attribution using a graph-based representation. Proceedings of the International Conference on Electronics, Communications, and Computers (CONIELECOMP).

Citron, D. K. (2014). Hate Crimes in Cyberspace. Cambridge, MA: Harvard University Press.

Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. Proceedings of the 2003 ACM symposium on Applied computing, 784-788. doi:http://doi.acm.org/10.1145/952532.952688

Debole, F., & Sebastiani, F. (2004). Supervised Term Weighting for Automated Text Categorization. ERCIM News 56, 55-56.

Dhillon, I., Kogan, J., & Nicholas, C. (2004). Feature Selection and Document Clustering. In M. W. Berry (Ed.), Survey of Text Mining: Clustering, Classification, and Retrieval. New York: Springer.

Diab, M. (2009). Second Generation AMIRA Tools for Arabic Processing: Fast and RobustTokenization, POS tagging, and Base Phrase Chunking.  The 2nd International Conference on Arabic Language Resources and Tools. Cairo, Egypt.

Doultani, A. & Vijayalakshmi, M. (2019). Data Forensics On Social Media. 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), 1-5.

Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4), 1-22. doi:http://doi.acm.org/10.1145/1644879.1644881

Flanagan, J. A. (2005). Unsupervised clustering of context data and learning user requirements for a mobile device. The Proceedings of the 5th international conference on Modeling and Using Context, Paris, France.

Flynn, N. (2012). The Social Media Handbook: Rules, Policies, and Best Practices to Successfully Manage Your Organization’s Social Media Presence, Posts, and Potential. Hoboken: John Wiley & Sons.

Frigui, H., & Nasraoui, O. (2004). Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents. In M. W. Berry, (Ed.), Survey of Text Mining: Clustering, Classification, and Retrieval (pp. 87-94). New York: Springer.

Fung, B. Debbabi, M. & Iqbal, F. (2020). Machine Learning for Authorship Attribution and Cyber Forensics. London: Springer Nature

Gabrilovich, E., & Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Proceedings of the 20th International Joint Conference on Artificial Intelligence, 1606–1611.

Golbeck, J. (2018). Online Harassment. Cham: Switzerland: Springer.

Golub, K. (2006). Automated subject classification of textual Web documents. Journal of Documentation, 62(3), 350-371. https://doi.org/10.1108/00220410610666501

Gordon, A. D. (1996). Hierarchical Classification. In P. Arabie, L. J. Hubert, & G. d. Soete (Eds.), Clustering and classification (pp. ix-490). Singapore; River Edge, NJ.: World Scientific.

Görzig, A., & Frumkin, L. A. (2013). Cyberbullying experiences on-the-go: When social media can become distressing. Cyberpsychology, 7(1). https://doi.org/10.5817/CP2013-1-4

Grčar, M., Mladenič, D., Fortuna, B., & Grobelnik, M. (2005). Data sparsity issues in the collaborative filtering framework. The International Workshop on Knowledge Discovery on the Web.

Habash, N. (2010a). Arabic Natural Language Processing. San Rafael, California: Morgan & Claypool Publishers.

Habash, N. (2010b). Introduction to Arabic Natural Language Processing. San Rafael, California: Morgan & Claypool Publishers.

Habash, N., & Rambow, O. (2005). Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. Proceedings of the 43rd Annual Meeting of the ACL, 573–580. DOI: 10.3115/1219840.1219911

Hammouda, K. M., & Kamel, M. S. (2002). Phrase-based Document Similarity Based on an Index Graph Model. The Proceedings of the IEEE International Conference on Data Mining.

Hamouda, W. (2014). Anaphora Resolution for Arabic Machine Translation: A Case Study of Nafs, (Unpublished Ph.D. dissertation). University of Newcastle Upon Tyne, Newcastle. Retrieved from https://books.google.com.sa/books?id=IOfMoQEACAAJ

Holmes, D. I. (1998). The Evolution of Stylometry in Humanities Scholarship. Lit Linguist Computing, 13(3), 111-117. doi:10.1093/LLC/13.3.111

Holmes, D. I., & Forsyth, R. S. (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10(2), 111-127. doi:10.1093/LLC/10.2.111

Horton, T., Taylor, C., Yu, B., & Xiang, X. (2006). ‘Quite Right, Dear and Interesting’: Seeking the Sentimental in Nineteenth-Century American Fiction. The Digital Humanities Conference, Paris-Sorbonne, France.

Ilsemann, H. (2019). Forensic stylometry. Digital Scholarship in the Humanities34(2), 335-349.

Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Amsterdam: Kluwer Academic Publishers.

Jockers, M. L., Witten, D. M., & Criddle, C. S. (2008). Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification. Literary and Linguistic Computing, 23(4), 465-491. doi:10.1093/LLC/fqn040

Justo, R., & Torres, I. (2005). Statistical and Linguistic Clustering for Language Modeling in ASR. In A. Sanfeliu, & M. L. Cortes (Eds.), Progress in Pattern Recognition, Image Analysis, and Applications (Vol. 3773, pp. 556-565). Berlin, Heidelberg: Springer.

Khoufi, N., Aloulou, C., & Belguith, L. H. (2013). ARSYPAR: A tool for parsing the Arabic language based on supervised learning. The International Arab Conference on Information Technology, Zarqa University.

Kiraz, G. A. (2001). Computational Nonlinear Morphology: With Emphasis on Semitic Languages. Cambridge Cambridge University Press.

Koppel, M., Schler, J., & Argamon, S. (2013). Authorship Attribution: What’s Easy and What’s Hard? Journal of Law and Policy, 21, 317-331.

Kovacs, L., Repasi, T., Baksa-Varga, E., & Barabas, P. (2008). Clustering Based on Context Similarity. The First International Conference on Complexity and Intelligence of the Artificial and Natural Complex Systems. Medical Applications of the Complex Systems. Biomedical Computing, Targu Mures, Mures.

Kowalski, R. M., Limber, S. P., & Agatston, P. W. (2012). Cyberbullying: Bullying in the Digital Age. Chichester, West Sussex, UK; Malden, MA: Wiley-Blackwell.

Labbe, C., & Labbe, D. (2006). A Tool for Literary Studies: Intertextual Distance and Tree Classification. Lit Linguist Computing, 21(3), 311-326. doi:10.1093/LLC/fqi063

Lennon, B. (2018). Passwords: Philology, Security, Authentication. Harvard: Harvard University Press

López-Escobedo, F., Méndez-Cruz, C.-F., Sierra, G., & Solórzano-Soto, J. (2013). Analysis of Stylometric Variables in Long and Short Texts. Procedia – Social and Behavioral Sciences, 95, 604 – 611.

Lowry, P. B., Zhang, J., Wang, C., & Siponen, M. (2016). Why do adults engage in cyberbullying on social media? Integration of online disinhibition and deindividuation effects with the social structure and social learning model. Information Systems Research, 27(4), 962-986. https://doi.org/10.1287/isre.2016.0671

Lukka, S., & Shaik, R. (2016). A Well Organized Phrase-Based Document Clustering Using ASCII Values and Adjacency List. Proceedings of the Eighth International Conference on Soft Computing and Pattern Recognition (SoCPaR 2016), 113-120.

MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge University Press.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press.

Maranis, H., & Babenko, D. (2009). Algorithms of the Intelligent Web. Greenwich: Manning Publications Co.

Martin, H., Claes, W., & Thomas, T. (2005). Experimental context classification: Incentives and experience of subjects. The Proceedings of the 27th international conference on Software engineering, St. Louis, MO, USA.

McEnery, T., Hardie, A., & Younis, N. (Eds.). (2018). Arabic Corpus Linguistics. Edinburgh Edinburgh University Press.

Mirkin, B. (2005). Clustering for Data Mining: A Data Recovery Approach. CRC Press.

Mitkov, R. (2004). The Oxford Handbook of Computational Linguistics. Oxford Oxford University Press.

Mitkov, R. (2014). Anaphora Resolution. London; New York: Routledge.

Momina, B. F., Kulkarnia, P. J., & Chaudharia, A. A. (2007). Web Document Clustering Using Document Index Graph. International Journal of Information Processing, 1(2), 49 – 57.

Mosteller, F., & Wallace, D. L. (1964). Inference and Disputed Authorship: The Federalist. Reading, Mass.: Addison-Wesley Pub. Co.

Nakamura, J., & Sinclair, J. (1995). The World of Woman in the Bank of English: Internal Criteria for the Classification of Corpora. Literary and Linguistic Computing, 10(2), 99-110. doi:10.1093/LLC/10.2.99

Nawar, M. (2014). Improving Arabic Tokenizationand POS Tagging Using Morphological Analyzer. In H. A.E., T. M.F., & T. Azar (Eds.), Advanced Machine Learning Technologies and Applications (Vol. 488, pp.46-53). Cham: Springer.

Omar, A. (2020). Feature Selection in Text Clustering Applications of Literary Texts: A Hybrid of Term Weighting Methods. International Journal of Advanced Computer Science and Applications, 11(2), 99-107. DOI: 10.14569/IJACSA.2020.0110214

Omar, A. & Deraan, B. (2019) Towards a Linguistic Stylometric Model for the Authorship Detection in Cybercrime Investigations. International Journal of English Linguistics. 9 (5), 182-192

Omar, A., Elghayesh, B. I., & Kassem, M. A. M. (2019). Authorship Attribution Revisited: The Problem of Flash Fiction: A morphological-based Linguistic Stylometry Approach. Arab World English Journal, 10(3)318-329.DOI: https://dx.doi.org/10.24093/awej/vol10no3.22

Omar, A. & Hamouda, W. (2020) The Effectiveness of Stemming in the Stylometric Authorship Attribution in Arabic. International Journal of Advanced Computer Science and Applications. 11 (1), 116-121

Ozgur, Y. (2006). Empirical selection of NLP-driven document representations for text categorization. Syracuse, New York: Syracuse University.

Paton, J. M., & Can, F. (2004). A Stylometric Analysis of Yas¸ ar Kemal’s I_nce Memed Tetralogy. Computers and the Humanities 38, 457–467.

Pedersen, T. (2008). Computational Approaches to Measuring the Similarity of Short Contexts: A Review of Applications and Methods. https://arxiv.org/abs/0806.3787

Pedrycz, W. (2005). Knowledge-Based Clustering: From Data to Information Granules: Wiley.

Purandare, A., & Pedersen, T. (2004). SenseClusters – Finding Clusters that Represent Word Senses. Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04). San Jose, USA, July 25-29, 2004.

Raghavan, S. (2010). Authorship Attribution Using Probabilistic Context-Free Grammars. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. (ACL 2010), Uppsala, Sweden.

Ramsay, S. (2005). In Praise of Pattern. TEXT Technology: the Journal of Computer Text Processing, 14(2), 177-190.  DOI:10.1093/LLC/18.2.167

Reader, B. (2012). Free press vs. free speech? The rhetoric of “civility” in regard to anonymous online comments. Journalism & Mass Communication Quarterly, 89(3), 495-513. DOI: 10.1177/1077699012447923

Rexha, A., Kröll, M., Ziak, H., & Kern, R. (2018). Authorship identification of documents with high content similarity. Scientometrics, 115(1), 223-237. doi:10.1007/s11192-018-2661-6

Roth, R., Rambow, O., Habash, N., Diab, M., & Rudin, C. (2008). Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. Proceedings of ACL-08: HLT. Short Papers, Columbus, Ohio.

Rudman, J. (1997). The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities, 31(4), 351-365.

Rudman, J. (2012). The State of Non-Traditional Authorship Attribution Studies—2012: Some Problems and Solutions. English Studies, 93(3), 259-274. doi:10.1080/0013838X.2012.668785

Salton, G. (1971). The Smart retrieval system: experiments in automatic document processing. Englewood Cliffs: Prentice Hall Inc.

Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), 613–620. https://doi.org/10.1145/361219.361220

Savoy, J. (2020). Machine Learning Methods for Stylometry: Authorship Attribution and Author Profiling. New York: Springer International Publishing

Schallbruch, M., & Skierka, I. (2018). Cybersecurity in Germany. Cham: Switzerland: Springer.

Sebastiani, F. (2006). Classification of Text, Automatic.  . In K. Brown (Ed.), Encyclopedia of Language & Linguistics (2nd ed., Vol. 2, pp. 457-462). Oxford: Elsevier.

Soudi, A., Farghaly, A., Neumann, G., & Zibib, R. (2012). Challenges for Arabic Machine Translation. John Benjamins Publishing.

Srivastava, A. N., & Sahami, M. (Eds.). (2009). Text Mining Classification, Clustering, and Applications. London: Chapman and Hall.

Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of American Society for Information,  Science, and Technology, 60(3), 538-556. doi:10.1002/asi.v60:3

Tambouratzis, G., & Vassiliou, M. (2007). Employing Thematic Variables for Enhancing Classification Accuracy Within Author Discrimination Experiments. Literary and Linguistic Computing, 22(2), 207-224. doi:10.1093/LLC/fqm003

Theodoridis, S., & Koutroubas, K. (2003). Pattern Recognition (2nd ed.). San Diego, CA: Academic Press.

Unsworth, J. (2000). Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might Our Tools Reflect This? The Symposium on Humanities Computing: Formal Methods, Experimental Practice. King’s College, London.

Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge: Cambridge University Press.

Yu, B. (2008). An Evaluation of Text Classification Methods for Literary Study. Literary and Linguistic Computing, 23(3), 327-343. doi:10.1093/LLC/fqn015

Yu, B., & Unsworth, J. (2006). Toward Discovering Potential Data Mining Applications in Literary Criticism. The Digital Humanities, Paris-Sorbo.

Zaki, Y., Hajjar, H., Hajjar, M., & Bernard, G. (2017). Towards the development of a statistical parser of the Arabic language. The 2017 Computing Conference, 85-87. London, United Kingdom

Facebook
Twitter
LinkedIn
Tumblr
Reddit
Email
StumbleUpon
Digg
http://orcid.org/0000-0003-3778-5801
 https://dx.doi.org/10.24093/awej/vol11no4.31

Haroon Alsager finished his Ph.D. in linguistics at Arizona State University in 2017. His research
interests include syntax, historical linguistics, and computational linguistics. Currently, he is an
assistant professor of linguistics at Prince Sattam Bin Abdulaziz University.
ORCID: http://orcid.org/0000-0003-3778-5801