获取多个URL的Python方法有很多,以下是一些常用的方法:
1. 使用`requests`库和`BeautifulSoup`库:
```python
import requests
from bs4 import BeautifulSoup
urls = [
'http://www.example.com/page1',
'http://www.example.com/page2',
'http://www.example.com/page3'
]
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
获取网页标题和正文内容
title = soup.title.string
content = soup.find('body').get_text()
print('标题:', title)
print('正文内容:', content)
2. 使用`Scrapy`框架递归调用`parse`方法:
```python
from scrapy.spiders import Spider
class QiubaiSpider(Spider):
name = 'qiubai'
allowed_domains = ['www.qiushibaike.com/text']
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
提取所有URL
for link in response.css('a::attr(href)').getall():
yield response.follow(link, self.parse)
3. 使用`lxml`库和XPath表达式:
```python
from lxml import html
tree = html.fromstring(html_content)
links = tree.xpath('//a/@href')
for link in links:
print(link)
4. 使用`urllib`库和`BeautifulSoup`库:
```python
from bs4 import BeautifulSoup
import urllib.request
def scanpage(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
pageurls = soup.find_all('a', href=True)
for links in pageurls:
if url in links.get('href') and links.get('href') not in Upageurls and links.get('href') not in websiteurls:
Upageurls[links.get('href')] = 0
for links in Upageurls.keys():
try:
urllib.request.urlopen(links).getcode()
except:
print('connect failed')
else:
Upageurls[links] = urllib.request.urlopen(links).getcode()
5. 批量获取百度搜索结果的URL:
```python
import requests
DOMAIN = 'https://www.baidu.com/s?wd='
a = input('请输入搜索关键词:')
b = int(input('请输入爬取的页数:'))
c = int((b-1)*10+1)
for i in range(0, c, 10):
d = str(i)
url = str(DOMAIN + a + '&pn=' + d)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'Cookie': 'PSTM=; BIDUPSID=C6D409FA9EC7DBCD64A2D7581; BD_UPN=;'
}
response = requests.get(url, headers=headers)
处理响应内容
以上代码示例展示了如何使用不同的Python库和工具来获取多个URL。请根据您的具体需求选择合适的方法。