在Python中设置停用词通常是为了在文本处理过程中排除一些常见的、对分析没有帮助的词汇。以下是使用不同库设置停用词的方法:
使用Spacy设置停用词
import spacy加载英文模型nlp = spacy.load("en_core_web_sm")更新停用词列表nlp.Defaults.stop_words.add("my_new_stopword")或者一次添加多个停用词nlp.Defaults.stop_words |= {"my_new_stopword1", "my_new_stopword2"}删除单个停用词nlp.Defaults.stop_words.remove("my_old_stopword")处理文本sentence = nlp("the word is definitely not a stopword")print([token.is_stop for token in sentence])
使用jieba设置停用词
import jieba加载自定义停用词stopwords_file = "stopwords.txt"with open(stopwords_file, "r", encoding="utf-8") as f:stopwords = [line.strip() for line in f.readlines() if line.strip()]分词并过滤停用词filename = "gp.txt"with open(filename, "r", encoding="utf-8") as f:result = []for line in f.readlines():line = line.strip()if not line:continueseg_list = jieba.cut(line, cut_all=False)filtered_line = [word for word in seg_list if word not in stopwords and word != "\t"]result.append(" ".join(filtered_line))print("\n".join(result))
使用biased-stop-words库设置停用词
from biasedstopwords import BiasedStopWords获取偏见停用词列表bsw = BiasedStopWords()bias_words = bsw.get_biased_words()print(bias_words)移除偏见停用词text = "这里是一些包含偏见词汇的文本"clean_text = bsw.remove_biased_words(text)print(clean_text)

使用列表设置停用词
创建一个包含常见停用词的列表stopwords = ["的", "和", "是", "了", "在", "它", "这", "那"]检查一个单词是否是停用词def is_stopword(word):return word in stopwords
使用HanLP设置停用词
from pyhanlp.hanlp import HanLP加载停用词字典trie = HanLP.newTrie("data/dictionary/stopwords.txt")删除停用词def remove_stopwords(termlist, trie):return [term.word for term in termlist if not trie.contains(term.word)]
使用WordCloud设置停用词
from wordcloud import WordCloud, STOPWORDS读取停用词列表stopwords = set(STOPWORDS)生成词云图像text = "这里是你要处理的文本内容"wordcloud = WordCloud(stopwords=stopwords).generate(text)显示词云图像import matplotlib.pyplot as pltplt.imshow(wordcloud, interpolation="bilinear")plt.axis("off")plt.show()
以上是使用不同库在Python中设置停用词的方法。请根据你的具体需求选择合适的库和设置方法
