Annotation and Classification of Toxicity for Thai Twitter

TitleAnnotation and Classification of Toxicity for Thai Twitter
Publication TypeConference Paper
Year of Publication2017
AuthorsSirihattasak S, Komachi M, Ishikawa H
Conference NameSecond Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS 2018)
PublisherEuropean Language Resources Association (ELRA)
Conference LocationMiyazaki, Japan
ISBN Number979-10-95546-00-9

In this study, we present toxicity annotation for a Thai Twitter Corpus as a preliminary exploration for toxicity analysis in the Thai language. We construct a Thai toxic word dictionary and select 3,300 tweets for annotation using the 44 keywords from our dictionary. We obtained 2,027 toxic tweets and 1,273 non-toxic tweets labeled by three annotators. The result of corpus analysis indicates that tweets that include toxic words are not always toxic. Further, it is more likely to that a tweet is toxic, if it contains toxic words indicating their original meaning. Moreover, disagreements in annotation are primarily due to sarcasm, unclear existing target, and word sense ambiguity. Finally, we conducted supervised classification using our corpus as a dataset and obtained an accuracy of 0.80, which is comparable with the inter-annotator agreement of this dataset. Our dataset is available on GitHub.