# pdf_auto_marker

**Repository Path**: starphin/pdf_auto_marker

## Basic Information

- **Project Name**: pdf_auto_marker
- **Description**: 基于大模型的PDF敏感信息自动识别和打码工具，支持并行处理，能够处理几千页的超大文件。
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-04
- **Last Updated**: 2026-03-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: AI, 信息安全, pdf, 脱敏, 打码

## README

# PDF敏感信息自动打码工具

基于大模型的PDF敏感信息自动识别和打码工具，支持流式处理和并行计算，能够处理几千页的超大文件。

## 功能特点

- 智能识别敏感信息：身份证号、电话、住址、统一社会信用代码、银行卡号、生日等，其他敏感信息可以修改提示词支持。
- 支持敏感字段换行识别（如姓名跨两行），支持地址跨越多行识别
- 并行计算支持：多线程并行处理，适合几千页的大文件
- 流式内存处理：无需拆分PDF为临时文件，减少磁盘IO
- 页码范围调试：支持指定页码范围，方便测试和调试

## 打码效果
![打码效果图](images/打码效果1.png)

## 技术架构

```
输入PDF → 流式读取页面 → 页面转图片(内存) → 敏感信息识别(大模型/百度OCR/混合检测) → 图片打码(内存) → 流式写入输出PDF → 输出PDF
```

### 核心模块

- `text_detector.py` - 文字检测器抽象接口
- `model_text_detector.py` - 大模型检测器实现
- `baidu_ocr_detector.py` - 百度OCR检测器实现
- `hybrid_detector.py` - 混合检测方式实现
- `pdf_stream_processor.py` - 流式处理PDF核心类
- `parallel_processor.py` - 并行处理框架
- `main.py` - 主流程入口
- `config.json` - 配置文件

## 安装说明

### 系统要求

- Python 3.8+
- poppler（用于PDF转图片）

### 安装poppler

**macOS:**
```bash
brew install poppler
```

**Ubuntu/Debian:**
```bash
sudo apt-get install poppler-utils
```

**Windows:**
下载并安装 [poppler for Windows](http://blog.alivate.com.au/poppler-windows/)

### 安装Python依赖

```bash
pip install -r requirements.txt
```

## 配置文件

创建 `config.json` 文件配置参数：

```json
{
  "detector_type": "hybrid",
  "model": {
    "api_key": "your-model-api-key",
    "base_url": "https://your-model-api-endpoint",
    "model_name": "deepseek-v3.2"
  },
  "baidu_ocr": {
    "api_key": "your-baidu-api-key",
    "secret_key": "your-baidu-secret-key"
  },
  "processing": {
    "parallel": false,
    "workers": 4,
    "dpi": 300,
    "max_image_size": 1024,
    "mask_color": "#000000"
  }
}
```

## 使用方法

### 调试模式（快速测试）

```bash
python main.py \
  --input input.pdf \
  --output output.pdf \
  --debug
```

### 使用大模型检测器

```bash
python main.py \
  --input input.pdf \
  --output output.pdf \
  --config config.json
```

### 使用百度OCR检测器

```bash
python main.py \
  --input input.pdf \
  --output output.pdf \
  --config config.json
```

### 页码范围处理

```bash
python main.py \
  --input input.pdf \
  --output output.pdf \
  --config config.json \
  --page-range 1-5
```

### 命令行参数说明

| 参数 | 说明 | 必需 | 默认值 |
|------|------|------|--------|
| `--input` | 输入PDF文件路径 | 是 | - |
| `--output` | 输出PDF文件路径 | 是 | - |
| `--config` | 配置文件路径 | 否 | config.json |
| `--debug` | 调试模式：直接处理PDF并输出结果 | 否 | False |
| `--page-range` | 处理页码范围，格式: "start-end" | 否 | 全部页面 |

## 注意事项

1. **API限流**：大量并发请求可能触发API限流，建议适当控制并发数
2. **临时文件**：流式处理减少了内存占用，但仍需确保足够内存
3. **大文件处理**：几千页的大文件建议使用并行模式
4. **DPI设置**：DPI越高，识别精度越高，但处理速度越慢
5. **配置文件**：确保config.json中的API密钥正确

## 故障排查

### 常见问题

**Q: 提示找不到poppler**
```bash
# macOS
brew install poppler

# Ubuntu
sudo apt-get install poppler-utils
```

**Q: API调用失败**
- 检查配置文件中的API密钥是否正确
- 检查网络连接
- 检查API配额

**Q: 识别准确率不高**
- 大模型：提高DPI参数
- 百度OCR：确保图片清晰度

## 开发指南
### 切换到其他模型

修改 `config.json` 中的 `model` 配置：

```json
{
  "model": {
    "api_key": "your-api-key",
    "base_url": "https://your-api-endpoint",
    "model_name": "your-new-model"
  }
}
```

## 许可证

Apache 2.0 License


## 贡献

欢迎提交Issue和Pull Request！

## 联系方式

如有问题或建议，请提交Issue。