Python 爬虫技巧：百度页面重定向的自动跟踪与处理_业界新闻

发布时间:2024-08-07 19:46

阅读量:0

在网络爬虫的开发过程中，重定向是一个常见的现象，尤其是在访问大型网站如百度时。重定向可以是临时的，也可以是永久的，它要求爬虫能够自动跟踪并正确处理这些跳转。本文将探讨如何使用 Python 编写爬虫以自动跟踪并处理百度页面的重定向。

理解 HTTP 重定向

HTTP 重定向是服务器告诉客户端（如浏览器或爬虫）请求的资源现在位于另一个 URL。HTTP 状态码 301（永久移动）和 302（临时移动）是最常见的重定向状态码。

301 重定向

表示资源已被永久移动到新的 URL，爬虫应该更新其索引以使用新的 URL。

302 重定向

表示资源临时移动到新的 URL，爬虫可以继续使用原始 URL。

使用 Python urllib 处理重定向

Python 的 urllib 模块提供了处理 HTTP 请求的工具，包括自动处理重定向。然而，有时候我们需要更细粒度的控制，例如限制重定向次数或记录重定向历史。

自动处理重定向

urllib 的 urlopen 函数会自动处理重定向，但默认情况下不提供重定向的详细信息。以下是一个示例，展示如何使用 urllib 自动处理重定向：

python import urllib.request  def fetch_url(url):     try:         response = urllib.request.urlopen(url)         return response.read().decode('utf-8')     except urllib.error.URLError as e:         print(f"Failed to reach a server: {e.reason}")         return None  # 使用示例 content = fetch_url('http://www.baidu.com')

自定义重定向处理

为了更细粒度的控制，我们可以自定义重定向处理逻辑：

python from urllib import request, error  class RedirectHandler(request.HTTPRedirectHandler):     def __init__(self, max_redirects=10):         super().__init__()         self.max_redirects = max_redirects         self.redirect_count = 0      def http_error_302(self, req, fp, code, msg, headers):         self.redirect_count += 1         if self.redirect_count >= self.max_redirects:             raise error.HTTPError(req.full_url, code, msg, headers, fp)         return super().http_error_302(req, fp, code, msg, headers)  def fetch_url_with_redirect_handling(url):     opener = request.build_opener(RedirectHandler())     request.install_opener(opener)     try:         with request.urlopen(url) as response:             return response.read().decode('utf-8')     except error.HTTPError as e:         print(f"HTTP error: {e.code}")         return None     except error.URLError as e:         print(f"URL error: {e.reason}")         return None  # 使用示例 content = fetch_url_with_redirect_handling('http://www.baidu.com')

持久连接

持久连接允许在一个 TCP 连接上发送多个 HTTP 请求和响应，减少了连接建立和关闭的开销。urllib 模块在 Python 3.6 之后默认支持 HTTP/1.1 的持久连接。

使用 http.client 实现持久连接

以下是一个使用 http.client 实现持久连接的示例：

import http.client from urllib.parse import urlparse from http.client import HTTPResponse  # 代理服务器设置 proxyHost = "www.16yun.cn" proxyPort = "5445" proxyUser = "16QMSOML" proxyPass = "280651"  class PersistentHTTPConnection(http.client.HTTPConnection):     def __init__(self, host, port=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,                   proxy_info=None):         super().__init__(host, port, timeout)         self.proxy_info = proxy_info      def connect(self):         # 连接到代理服务器         super().connect()         if self.proxy_info:             # 使用 Basic Auth 认证             username, password = self.proxy_info             credentials = f"{username}:{password}"             credentials = "Basic " + credentials.encode('utf-8').base64().decode('utf-8')             self.sock.sendall(b"Proxy-Authorization: " + credentials.encode('utf-8'))  class PersistentHTTPConnectionWithProxy(PersistentHTTPConnection):     def __enter__(self):         return self      def __exit__(self, exc_type, exc_val, exc_tb):         self.close()  def fetch_with_persistent_connection(url, proxy_info=None):     parsed_url = urlparse(url)     conn = PersistentHTTPConnectionWithProxy(parsed_url.netloc, proxy_info=proxy_info)     conn.connect()  # 连接到代理服务器     conn.request("GET", parsed_url.path)     response = conn.getresponse()     if response.status == 200:         return response.read().decode('utf-8')     else:         print(f"HTTP error: {response.status}")         return None  # 使用示例 content = fetch_with_persistent_connection('http://www.baidu.com', (proxyUser, proxyPass))