在Python中提取文本中的单词可以通过多种方法实现,以下是几种常用的方法:
1. 使用字符串的`split()`方法:
```python
text = "This is a sentence with several words"
words = text.split()
print(words) 输出:['This', 'is', 'a', 'sentence', 'with', 'several', 'words']
2. 使用正则表达式模块`re`的`findall()`函数:
```python
import re
text = "This is a sentence with several words"
words = re.findall(r'\b\w+\b', text)
print(words) 输出:['This', 'is', 'a', 'sentence', 'with', 'several', 'words']
3. 使用`nltk`库进行文本预处理和分词:
```python
import nltk
nltk.download('punkt')
text = "This is a sentence with several words"
words = nltk.word_tokenize(text)
print(words) 输出:['This', 'is', 'a', 'sentence', 'with', 'several', 'words']
4. 使用`re`模块去除非字母字符后分词:
```python
import re
text = "This is a sentence with several words"
line = re.sub(r'[^A-Za-z]', ' ', text.strip())
words = line.split()
print(words) 输出:['This', 'is', 'a', 'sentence', 'with', 'several', 'words']
5. 使用`re`模块去除HTML标签后分词(如果文本中包含HTML标签):
```python
import re
def strip_html(text):
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
text_with_html = "
This is a sentence with several words