Python 爬虫技术

约 2798 字大约 9 分钟

2025-02-20

爬虫技术，即通过程序自动抓取网页信息的技术。

requests库介绍

Requests是一个Python库，用于发送HTTP请求，并获取响应。Requests 完全满足今日 web 的需求。指导手册

Keep-Alive & 连接池
国际化域名和 URL
带持久 Cookie 的会话
浏览器式的 SSL 认证
自动内容解码
基本/摘要式的身份认证
优雅的 key/value Cookie
自动解压
Unicode 响应体
HTTP(S) 代理支持
文件分块上传
流下载
连接超时
分块请求
支持 .netrc

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Accept-Language": "zh-CN,en;q=0.9",
    "Authorization": "Bearer your_token_here",
    "Accept": "*/*",
    "Accept-Encoding": "gzip"
}

r = requests.get('https://api.github.com/user', auth=('user', 'pass'), headers=headers)

r.headers['content-type']

r.encoding

r.text

r.json()

高级用户可以使用 requests 库的 session 对象来管理会话，从而实现更复杂的功能，如自动处理重定向、Cookie 等。

import requests

# 跨请求保持一些 cookie:
s = requests.Session()
# 请求方法提供缺省数据
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789', timeout=5)
r = s.get("http://httpbin.org/cookies")

# 会话前后文管理器 with 区块退出后会话能被关闭，即使发生了异常也一样。
with requests.Session() as s:
    s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
# SSL 证书验证¶
requests.get('https://github.com', verify=True)
requests.get('https://github.com', verify='/path/to/certfile')
# 状态码字典
requests.codes.ok

客户端证书、CA 证书、响应体内容工作流、保持活动状态（持久连接）、流式上传、块编码请求、POST 多个分块编码的文件、事件挂钩、自定义身份验证、流式请求、代理、SOCKS、超时等等参考官网高级用法>>

BeautifulSoup库介绍

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

soup = BeautifulSoup(html_doc, 'html.parser')

soup.head
soup.body
# 获取所有a标签
soup.find_all('a')
# 关键字参数
soup.find_all(id='link2')
# 按CSS搜索
soup.find_all("a", class_="sister")
css_soup.find_all("p", class_="body strikeout")
# 通过 string 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样 字符串 , 正则表达式 , 列表, True 
soup.find_all(string="Elsie")
# limit 参数
soup.find_all("a", limit=2)
# tag的直接子节点,可以使用参数 recursive=False .
soup.html.find_all("title", recursive=False)
# 获取所有匹配项
soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]
# 获取第一个匹配项
soup.find('title')
# <title>The Dormouse's story</title>

# CSS选择器
soup.select("#link1")
soup.select("p:nth-of-type(3)")
soup.select("p > a")
soup.select(".sister")
soup.select('a[href]')
# 返回查找到的元素的第一个
soup.select_one(".sister")
# Tag中添加内容
soup.append("<a>xx</a>")
# 插入内容
tag.insert(1, "but did not endorse ")
soup.string.insert_before('xxx')
soup.insert_after('<h7>xx</h7>')
# PageElement.extract() 方法将当前tag移除文档树,并作为方法结果返回
soup.extract()
# 将当前节点移除文档树并完全销毁
soup.decompose()
# 移除文档树中的某段内容,并用新tag或文本节点替代它
a_tag.replace_with(new_tag)
# 将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行
soup.prettify()
# tag中包含的文本内容
soup.get_text()

selenium库介绍

Selenium WebDriver 获取动态页面内容时，通常需要处理 JavaScript 渲染的内容。Selenium 是一个强大的工具，可以模拟浏览器行为，等待页面加载完成并提取动态生成的内容。参考指导>>

pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# 设置 WebDriver 路径
webdriver_path = "/path/to/chromedriver"  # 替换为你的 WebDriver 路径

# 初始化 WebDriver
service = Service(webdriver_path)
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

try:
    # 打开目标网页
    url = "https://example.com"
    driver.get(url)

    # 等待页面加载完成（例如等待某个元素出现）
    wait = WebDriverWait(driver, 10)  # 最长等待 10 秒
    element = wait.until(EC.presence_of_element_located((By.ID, "dynamic-element-id")))

    # 滚动页面以加载更多内容（如果需要）
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)  # 等待页面加载新内容

    # 获取页面源代码
    page_source = driver.page_source
    print(page_source)

finally:
    # 关闭浏览器
    driver.quit()

htmlUnit库介绍

HtmlUnit是一款java的无界面浏览器程序库。它模拟HTML文档，并提供相应的API，允许您调用页面，填写表单，点击链接等操作，就像您在“正常”浏览器中做的一样。它有相当不错的JavaScript支持（还在不断改进），甚至能够处理相当复杂的AJAX库，模拟Chrome，Firefox或Internet Explorer取决于使用的配置。它通常用于测试目的或从网站检索信息。参考资料>>

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.html.HtmlDivision;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.WebClientOptions;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.html.HtmlBody;
import java.util.List;

public class helloHtmlUnit{
    public static void main(String[] args) throws Exception{
        String str;
        //创建一个webclient
        WebClient webClient = new WebClient();
        //htmlunit 对css和javascript的支持不好，所以请关闭之
        webClient.getOptions().setJavaScriptEnabled(false);
        webClient.getOptions().setCssEnabled(false);
        //获取页面
        HtmlPage page = webClient.getPage("http://www.baidu.com/");
        //获取页面的TITLE
        str = page.getTitleText();
        System.out.println(str);
        //获取页面的XML代码
        str = page.asXml();
        System.out.println(str);
        //获取页面的文本
        str = page.asText();
        System.out.println(str);
        //关闭webclient
        webClient.closeAllWindows();
    }
}

备注：可以将htmlUnit封装到jar文件，让python调用会获取结果，然后对结果进行解析，实现动态页面对js代码执行并获取最终静态HTML代码。Python执行jar文件需要利用subprocess模块

python执行jar获取结果示例

import subprocess
def run_jar_file(jar_path, *args):
    # 构建命令行参数
    command = ['java', '-jar', jar_path] + list(args)
    
    try:
        # 执行命令并捕获输出
        result = subprocess.run(command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        
        # 返回标准输出和标准错误
        return result.stdout, result.stderr
    
    except subprocess.CalledProcessError as e:
        # 如果 JAR 文件执行失败，抛出异常
        print(f"Error occurred while running JAR file: {e}")
        print(f"Standard Error: {e.stderr}")
        return None, e.stderr
        
if __name__ == "__main__":
    jar_path = "path/to/your/file.jar"  # 替换为你的 JAR 文件路径
    args = ["arg1", "arg2"]  # 替换为你需要传递给 JAR 文件的参数

    stdout, stderr = run_jar_file(jar_path, *args)
    
    if stdout:
        print("JAR Output:")
        print(stdout)
    
    if stderr:
        print("JAR Error:")
        print(stderr)

Scrapy爬虫框架

Scrapy 是一个快速、高级的网页爬取和网页抓取框架，用于爬取网站并从其页面中提取结构化数据。它可以用于各种用途，从数据挖掘到监控和自动化测试。

整体架构图

特点

强大的生态系统提供了丰富的内置功能，如中间件、管道、扩展等，可以轻松处理各种复杂的需求，支持多种数据导出格式（JSON、CSV、XML 等）。
高性能：使用异步网络库 Twisted，能够高效地处理大量并发请求。通过内置的下载器中间件和调度器，可以轻松实现分布式爬取。
灵活性：架构非常灵活，允许开发者自定义中间件、管道和扩展。支持 XPath 和 CSS 选择器，方便解析 HTML 数据。
社区支持: 拥有庞大的用户社区，文档完善，遇到问题时很容易找到解决方案。有许多第三方插件和扩展（如 scrapy-splash 用于处理 JavaScript 渲染页面）。
可扩展性 :Scrapy 可以与许多其他工具集成，如 Scrapyd（用于部署和管理爬虫）、Frontera（用于大规模爬取任务的边界管理）等。
学习曲线较陡：对于初学者来说，Scrapy 的架构和概念（如 Spider、Item Pipeline、Middleware 等）可能需要一些时间来理解。
不适合小型项目: 如果你只需要抓取少量数据，Scrapy 可能显得过于复杂。

Pyspider爬虫框架

pyspider是一个强大的Python网络爬虫框架，具备完整的Web UI和脚本编辑器。它支持多种数据库后端、优先级控制、分布式管理，以及强大的调试工具，是数据抓取和网络爬虫开发者的重要工具。

参考资料>>

特点

强大的Web UI：通过Web界面创建、监控、编辑和调试爬虫。
多种数据库支持：支持MySQL、MongoDB、SQLite等多种数据存储方案。
结果管理：爬取结果直观展示，支持数据导出。
任务调度：基于优先级的任务调度系统。
脚本支持：支持Python语言脚本，灵活定义爬虫行为。
容易上手：入门简单，中小项目首先。

代码示例

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

其他抓取工具

精悍requests-html库

一个比requests增强库，支持动态页面获取抓取，比selenium轻量、高效。令人欣喜的请求体验，魔法般的解析页面。

全面支持解析JavaScript!
CSS 选择器 (jQuery风格, 感谢PyQuery)。
XPath 选择器, for the faint at heart.
自定义user-agent (就像一个真正的web浏览器).
自动追踪重定向.
连接池与cookie持久化.

官方指导手册

注意: requests-html在https://pypi.org/search/?q=requests-html中已有多个高级版本，如下：requests_html2(无法渲染动态html) requests-html-playwright(正常渲染)。代码如下：

from requests_html_playwright import HTMLSession
from requests_html_playwright.requests_html import HTMLResponse

with HTMLSession() as session:
    # 这个缺陷，无法设置浏览器头是否显示。无法观察页面执行过程
    response: HTMLResponse = session.get('http://quotes.toscrape.com/js/')
    response.html.render()
    print(response.html.html)

微软 Playwright

Playwright 支持所有现代渲染引擎，包括 Chromium、WebKit 和 Firefox 。适用于 Android 的 Google Chrome 和 Mobile Safari 的原生移动模拟。相同的渲染引擎可以在桌面和云端运行。无头或有头。

有弹性 • 没有碎片测试Playwright 在执行操作之前等待元素可操作。
网络优先的断言Playwright 断言是专门为动态网络创建的检查会自动重试，直到满足必要的条件。
可信事件悬停元素、与动态控件交互、生成可信事件。 Playwright 使用与真实用户无法区分的真实浏览器输入管道。
Playwright 为每个测试创建一个浏览器上下文。提供了零开销的全面测试隔离。
保存上下文的身份验证状态并在所有测试中重用它。这绕过了每个测试中的重复登录操作，但提供了独立测试的完全隔离。

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # headless为True时可以运行无头浏览器,可以使用 slow_mo 来减慢执行速度
    browser = p.chromium.launch(headless=False, slow_mo=0)
    # 创建公共上下文，设置cookie或header
    # c = browser.new_context()
    # c.new_page()
    page = browser.new_page()
    page.goto("http://quotes.toscrape.com/js/")
    doc = BeautifulSoup(page.content(), 'html.parser')
    print(doc.select('div.quote'))
    browser.close()

官方指导手册

Pyppeteer

无头 chrome/chromium 浏览器自动化库，pyppeteer这个项目是非官方的，是基于谷歌官方puppeteer的python版本。参考官网文档，软件很久没更新了，存在未知bug。

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://example.com')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

官方指导手册参考资料

感谢资料作者的贡献！ https://blog.csdn.net/ck784101777/article/details/104468780 https://blog.csdn.net/Python966/article/details/132908807