使用Python的jieba库进行词频统计的步骤如下:
1. 安装jieba库:
pip install jieba
2. 导入jieba库并读取文本文件:
import jieba
读取文本文件
with open('your_text_file.txt', 'r', encoding='utf-8') as file:
text = file.read()
3. 使用jieba进行分词:
分词
words = jieba.cut(text)
4. 统计词频:
创建一个字典来存储词频
word_count = {}
for word in words:
word_count[word] = word_count.get(word, 0) + 1
5. 输出结果:
输出词频
for word, count in word_count.items():
print(f'{word}: {count}')
6. (可选)加入停用词:
定义停用词列表
stopwords = ['is', 'the', 'and', 'in', 'to', 'of', 'a', 'an', 'for', 'with', 'about', 'as', 'by', 'on', 'at', 'from', 'that', 'which', 'who', 'whom', 'whose', 'this', 'these', 'those', 'there', 'where', 'when', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', 'now']
去除停用词
filtered_words = [word for word in words if word not in stopwords]
重新统计词频
word_count = {}
for word in filtered_words:
word_count[word] = word_count.get(word, 0) + 1
输出过滤后的词频
for word, count in word_count.items():
print(f'{word}: {count}')
以上步骤展示了如何使用jieba库进行基本的词频统计。如果需要更高级的功能,比如词云图的绘制,可以使用wordcloud库。