Python操作wrod文档

约 618 字大约 2 分钟

2025-02-20

使用 pypandoc 库可以方便地将 Word 文档（.docx）转换为其他格式，如 HTML、Markdown 等。然而，pypandoc 本身并不直接提供解析 Word 文档内容的功能，而是通过调用 pandoc 工具来完成格式转换。要解析 Word 文档的内容，通常需要先将其转换为 HTML 或 Markdown，然后再使用适当的库（如 BeautifulSoup 或 markdown）进行解析。 pypandoc同样可以将其他文本内容转换成word,如 html/markdown。

安装pypandoc

Debian/Ubuntu
```
sudo apt update
sudo apt install pandoc
```
Windows
- 下载安装程序：Pandoc 官方下载页面
- 按照安装向导完成安装。
使用 pypandoc.download_pandoc() 自动下载 Pandoc。如果你不想手动安装 Pandoc，可以通过 pypandoc 提供的工具自动下载并配置 Pandoc。在脚本中添加以下代码：

import pypandoc

# 自动下载并配置 Pandoc
pypandoc.download_pandoc()

# 指定 Pandoc 路径outputfile的路径，非必须
pypandoc.set_pandoc_path("/path/to/pandoc")

# 转换内容
output = pypandoc.convert_text(content, 'docx', format='html', outputfile=out_file)

注意：

pypandoc.download_pandoc() 会下载 Pandoc 并将其解压到临时目录。
这种方式适合快速测试或部署环境，但在生产环境中建议手动安装 Pandoc。

HTML和wrod互转

利用pypandoc和beautifulsoup4模块可以实现HTML转Word文档互转。

1.获取HTML代码转成word。以下示例从网络上获取一个页面然后解析其中需要的部分转存到word。

import pypandoc
import requests
from bs4 import BeautifulSoup

news_url = "http://news.sina.com.cn/xxx/xx/xxx.html"
response = requests.get(news_url, headers=headers)

if response.status_code == requests.codes.forbidden:
    msg = f'==>ERROR: 被限制访问！response status: {response.status_code}, url: {news_href}'
    raise Exception(msg)

bs_doc = BeautifulSoup(response.content, 'html.parser')
content_html = bs_doc.select_one('div[class="article-content"]')
# 移除 <img> 标签, 可以不移除。
for img in content_html.find_all('img'):
    img.decompose()

# 最近内容
content_html.append(f"<br> 文件来源：{news_url}")
# 转换html到word
pypandoc.convert_text(content_html, 'docx', format='html', outputfile='/data/xx.docx')

2.解析word内容并转成HTML pypandoc无法直接将word转存html, 它可以解析word内容然后借助BeautifulSoup来生成HTML结构

import pypandoc
from bs4 import BeautifulSoup

input_file = 'example.docx'
output_file = 'output.html'

output_html = pypandoc.convert_file(input_file, 'html', outputfile=output_file, encoding='utf-8')
# 读取生成的 HTML 文件
with open(output_file, 'r', encoding='utf-8') as file:
    html_content = file.read()

# 使用 BeautifulSoup 解析 HTML 内容
soup = BeautifulSoup(html_content, 'html.parser')
# 示例：提取所有段落内容
paragraphs = soup.find_all('p')
html_content = '<html><body>'
for i, paragraph in enumerate(paragraphs, start=1):
  html_content += f"<p>{paragraph.get_text(strip=True)}</p>"

html_content += '</body></html>'

# 查找包含特定文本的节点
# search_text = "特定文本"
# matching_nodes = soup.find_all(string=lambda text: search_text in text)

# todo 将文件写入到html文件即可。