Informational and Statistical Features-based Model for Removal of Stop Words in Multiple Languages
cosine similarity, data-driven, informational features, multiple languages, statistical features, stop word
Abstract
Stop word removal is critical in different tasks of Natural Language Processing (NLP) as it reduces corpus length and prepares data for down the line processing. Different languages have distinct predefined stop word list. However, some languages lack such predefined lists, complicating research efforts in those languages. There is no standardized method to identify stop words across all languages, and it is even more challenging to identify domainspecific stop words. This gap motivated our research. Our objective was to study the underlying reasons and develop a method to identify stop words in documents of at least two different languages. We utilized the Reuters News dataset for English and the Turkish News dataset for Turkish text. Our model, calculates the inverse document frequency (IDF), self-information, term frequency (TF), positive point wise mutual information (PPMI), context, cooccurrence, and length for each word, representing each word as a vector of these calculated features and the entire document as a set of word feature vectors. Experimentally, we determined two sets of threshold values for each feature and created two label vectors for stop words and non-stop words. By comparing the word feature vector with these label vectors using cosine similarity, we assigned labels based on the higher similarity value. We validated our model through three different methods and found that it accurately identified all the stop words in the NLTK library for both English and Turkish. Additionally, it was able to recognize domain-specific stop words and other relevant stop words that were missed during the preprocessing phase. These results highlight the potential of our model to be applied to other languages, paving the way for the creation of more comprehensive stop word lists for many under-researched languages..

