1. 使用内置的`split()`函数:
text = "This is an example."
words = text.split()
print(words) 输出:['This', 'is', 'an', 'example.']
2. 使用第三方库`wordninja`:
import wordninja
text = "thisisanexample"
result = wordninja.split(text)
print(result) 输出:['this', 'is', 'an', 'example']
3. 使用`symspellpy`库进行拼写校正和分词:
安装 symspellpy
pip install symspellpy
from symspellpy.symspellpy import SymSpell
加载词典
dictionary_path = "frequency_dictionary_en_82_765.txt"
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
text = "thisisanexample"
words = sym_spell.lookup_compound(text, max_edit_distance=2)
print([word for word, _ in words]) 输出:['this', 'is', 'an', 'example']
4. 使用正则表达式进行分词:
import re
text = "This is an example."
words = re.findall(r'\b\w+\b', text)
print(words) 输出:['This', 'is', 'an', 'example']
5. 使用`jieba`库进行中文分词,对于英文文本也可以使用,但需要先安装`jieba`库:
import jieba
text = "This is an example."
words = list(jieba.cut(text))
print(words) 输出:['This', 'is', 'an', 'example']