在Python中,使用代理进行爬虫可以通过以下步骤实现:
1. 安装`requests`模块:
```bash
pip install requests
2. 导入`requests`库:
```python
import requests
3. 设置`requests`的代理参数:
```python
proxies = {
'http': 'http://代理IP:端口',
'https': 'https://代理IP:端口'
}
response = requests.get(url, proxies=proxies)
4. 若需要随机选择代理IP,可以使用`random`库:
```python
import random
def get_random_proxy(proxy_list):
proxy = random.choice(proxy_list)
return {'http': proxy, 'https': proxy}
proxy_list = ['http://ip1:port1', 'http://ip2:port2', 'http://ip3:port3'] 代理IP列表
proxy = get_random_proxy(proxy_list)
response = requests.get(url, proxies=proxy)
5. 若要使用`urllib`库,可以通过创建`ProxyHandler`来设置代理:
```python
import urllib.request
proxy_handler = urllib.request.ProxyHandler({'http': 'http://代理IP:端口'})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)
response = urllib.request.urlopen(url)
6. 为了避免被服务器识别为爬虫,可以设置`User-Agent`:
```python
from fake_useragent import UserAgent
headers = {
'User-Agent': UserAgent().random 随机选择一个常见浏览器的User-Agent
}
response = requests.get(url, headers=headers, proxies=proxies)
7. 若要使用异步代理爬虫,可以使用`asyncio`和`aiohttp`库:
```python
import aiohttp
import asyncio
async def fetch(session, url, proxy):
async with session.get(url, proxy=proxy) as response:
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, 'http://example.com', proxy) for proxy in proxy_list]
responses = await asyncio.gather(*tasks)
for response in responses:
print(response)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
请根据实际需要选择合适的库和方法,并确保代理IP的有效性,定期更新代理列表以避免访问被拒绝