在Python中提取文本中的单词可以通过多种方法实现,以下是几种常用的方法:
1. 使用字符串的`split()`方法:
text = "This is a sentence with several words"words = text.split()print(words) 输出:['This', 'is', 'a', 'sentence', 'with', 'several', 'words']
2. 使用正则表达式模块`re`的`findall()`函数:
import retext = "This is a sentence with several words"words = re.findall(r'\b\w+\b', text)print(words) 输出:['This', 'is', 'a', 'sentence', 'with', 'several', 'words']
3. 使用`nltk`库进行文本预处理和分词:

import nltknltk.download('punkt')text = "This is a sentence with several words"words = nltk.word_tokenize(text)print(words) 输出:['This', 'is', 'a', 'sentence', 'with', 'several', 'words']
4. 使用`re`模块去除非字母字符后分词:
import retext = "This is a sentence with several words"line = re.sub(r'[^A-Za-z]', ' ', text.strip())words = line.split()print(words) 输出:['This', 'is', 'a', 'sentence', 'with', 'several', 'words']
5. 使用`re`模块去除HTML标签后分词(如果文本中包含HTML标签):
import redef strip_html(text):clean = re.compile('<.*?>')return re.sub(clean, '', text)text_with_html = "This is a sentence with several words
