# clawee_python **Repository Path**: lcyll/clawee_python ## Basic Information - **Project Name**: clawee_python - **Description**: clawee_python - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-16 - **Last Updated**: 2026-03-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Crawlee 爬虫接口服务 基于 Crawlee 框架的强大爬虫服务,支持 HTML 和 iframe 数据抓取,提供 RESTful API 接口。 ## 功能特性 - HTML 数据抓取(支持 BeautifulSoup 和 Playwright 两种引擎) - iframe 数据抓取(支持多层嵌套 iframe) - 多种认证方式(Session Token、Playwright 登录) - JWT Token 认证(兼容 C# 生成的 Token) - 代理支持 - Firefox 浏览器支持 - 完整的 API 文档 - Docker 容器化部署 - 健康检查端点 ## 技术栈 - Python 3.12+ - FastAPI - Web 框架 - Crawlee - 爬虫框架 - Playwright - 浏览器自动化(Firefox) - BeautifulSoup4 - HTML 解析 - PyJWT - JWT 认证 - Docker - 容器化部署 ## 项目结构 ``` clawee-python-my/ ├── base_model.py # 数据模型和公共工具函数 ├── craw_html.py # HTML 和 iframe 数据抓取模块 ├── jwt_helper.py # JWT 认证辅助模块 ├── main.py # FastAPI 接口服务 ├── requirements.txt # Python 依赖 ├── .env # 环境变量配置 ├── Dockerfile # Docker 构建文件 ├── .dockerignore # Docker 忽略文件 └── README.md # 项目文档 ``` ## 本地开发 ### 环境要求 - Python 3.12+ - pip ### 安装依赖 ```bash pip install -r requirements.txt ``` ### 配置环境变量 创建 `.env` 文件: ```env JwtAudience="Audience" JwtIssuer="Issuer" JwtExpMinutes="30" JwtSecurityKey="" ``` ### 启动服务 ```bash python main.py ``` 服务将在 http://localhost:7080 启动 ### 访问 API 文档 启动服务后,访问以下地址查看 API 文档: - Swagger UI: http://localhost:7080/docs - ReDoc: http://localhost:7080/redoc ## API 接口 ### 认证说明 所有需要认证的接口都需要在请求头中携带 JWT Token: ``` Authorization: Bearer ``` ### 1. 登录获取 JWT Token **接口:** `POST /api/auth/login` **请求参数:** ```json { "username": "admin", "password": "admin123" } ``` **响应示例:** ```json { "code": 200, "msg": "登录成功", "data": { "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", "expires_in": 1800 } } ``` ### 2. 抓取 iframe 数据 **接口:** `POST /api/crawl/iframe` **请求参数:** ```json { "target_url": "https://example.com", "iframe_selector": "iframe", "nested_iframe_selector": null, "data_selectors": ["div.content", "h1"], "proxy_url": null, "headless": true, "timeout": 10000 } ``` **响应示例:** ```json { "code": 200, "msg": "iframe数据抓取成功", "data": { "url": "https://example.com", "main_page_title": "Example Page", "main_page_data": {}, "iframe_data": { "div.content": ["Content 1", "Content 2"] }, "nested_iframe_data": null, "crawl_time": "2024-01-01T12:00:00.000000" } } ``` ### 3. 抓取 HTML 数据 **接口:** `POST /api/crawl/html` **请求参数:** ```json { "target_url": "https://example.com", "data_selectors": ["div.content", "h1", "article"], "proxy_url": null, "use_playwright": false, "headless": true, "timeout": 30000 } ``` **响应示例:** ```json { "code": 200, "msg": "HTML数据抓取成功", "data": { "url": "https://example.com", "data": { "div.content": ["Content 1", "Content 2"], "h1": ["Title 1"] }, "crawl_time": "2024-01-01T12:00:00.000000", "title": "Example Page", "h1": ["Title 1"], "h2": [], "h3": [], "paragraphs": [], "links": [], "images": [], "meta_description": null, "meta_keywords": null, "raw_html_length": 12345 } } ``` ### 4. Session Token 认证抓取 HTML 数据 **接口:** `POST /api/crawl/html/session_token` **请求参数:** ```json { "target_url": "https://example.com/api/data", "session_token": "your_session_token_here", "token_header_name": "Authorization", "token_prefix": "Bearer ", "data_selectors": ["div.content"], "proxy_url": null, "use_playwright": false, "headless": true, "timeout": 30000, "headers": null } ``` **响应示例:** ```json { "code": 200, "msg": "Session Token认证后HTML数据抓取成功", "data": { "url": "https://example.com/api/data", "data": { "div.content": ["Protected Content"] }, "crawl_time": "2024-01-01T12:00:00.000000", "auth_info": { "token_header_name": "Authorization", "token_prefix": "Bearer ", "token": "your_session_token_her..." } } } ``` ### 5. Playwright 登录后抓取 HTML 数据 **接口:** `POST /api/crawl/html/playwright_login` **请求参数:** ```json { "login_url": "https://example.com/login", "username": "user", "password": "pass", "target_url": "https://example.com/dashboard", "username_selector": "input[placeholder=\"用户名/邮箱\"]", "password_selector": "input[placeholder=\"密码\"]", "submit_selector": "button[type=\"submit\"]", "login_verify_selector": ".ant-pro-layout-container", "data_selectors": ["div.content"], "proxy_url": null, "headless": false, "timeout": 10000 } ``` **响应示例:** ```json { "code": 200, "msg": "Playwright浏览器登录后HTML数据抓取成功", "data": { "url": "https://example.com/dashboard", "data": { "div.content": ["Dashboard Content"] }, "crawl_time": "2024-01-01T12:00:00.000000", "auth_info": { "login_url": "https://example.com/login", "username": "user", "password": "*****" } } } ``` ### 6. Playwright 登录后抓取 iframe 数据 **接口:** `POST /api/crawl/iframe/playwright_login` **请求参数:** ```json { "login_url": "https://example.com/login", "username": "user", "password": "pass", "target_url": "https://example.com/dashboard", "iframe_selector": "iframe", "nested_iframe_selector": null, "stop_time": 3, "username_selector": "input[placeholder=\"用户名/邮箱\"]", "password_selector": "input[placeholder=\"密码\"]", "submit_selector": "button[type=\"submit\"]", "login_verify_selector": ".ant-pro-layout-container", "data_selectors": ["div.content"], "proxy_url": null, "headless": false, "timeout": 10000 } ``` **响应示例:** ```json { "code": 200, "msg": "Playwright浏览器登录后iframe数据抓取成功", "data": { "url": "https://example.com/dashboard", "main_page_title": "Dashboard", "main_page_data": {}, "iframe_data": { "div.content": ["Iframe Content"] }, "nested_iframe_data": null, "crawl_time": "2024-01-01T12:00:00.000000", "auth_info": { "login_url": "https://example.com/login", "username": "user", "password": "*****" } } } ``` ### 7. 健康检查 **接口:** `GET /health` **响应示例:** ```json { "status": "healthy", "service": "Crawlee Crawler API", "version": "2.0" } ``` ### 8. API 文档入口 **接口:** `GET /` 返回 API 文档的链接列表 ## Docker 部署 ### 构建 Docker 镜像 ```bash docker build -t crawlee-crawler:latest . ``` ### 运行 Docker 容器 ```bash docker run -d \ --name crawlee-crawler \ -p 7080:7080 \ -e JwtSecurityKey="your-secret-key-here" \ crawlee-crawler:latest ``` ### 使用 Docker Compose 创建 `docker-compose.yml` 文件: ```yaml version: '3.8' services: crawler: build: . container_name: crawlee-crawler ports: - "7080:7080" environment: - JwtAudience=Audience - JwtIssuer=Issuer - JwtExpMinutes=30 - JwtSecurityKey=************** restart: unless-stopped healthcheck: test: ["CMD", "curl", "-f", "http://localhost:7080/health"] interval: 30s timeout: 10s retries: 3 start_period: 5s ``` 启动服务: ```bash docker-compose up -d ``` 查看日志: ```bash docker-compose logs -f ``` 停止服务: ```bash docker-compose down ``` ## 配置说明 ### 环境变量 | 变量名 | 说明 | 默认值 | |--------|------|--------| | JwtAudience | JWT 受众 | Audience | | JwtIssuer | JWT 签发者 | Issuer | | JwtExpMinutes | Token 过期时间(分钟) | 30 | | JwtSecurityKey | JWT 签名密钥 | *************** | ### 代理配置 在请求参数中指定代理 URL: ```json { "proxy_url": "http://username:password@proxy-host:proxy-port" } ``` 支持的代理格式: - HTTP 代理: `http://user:pass@host:port` - HTTPS 代理: `https://user:pass@host:port` - SOCKS5 代理: `socks5://user:pass@host:port` ## JWT 认证说明 ### C# 客户端生成 Token 如果使用 C# 生成 JWT Token,请确保: 1. **算法匹配**: 使用 HMACSHA256 算法 2. **密钥一致**: C# 和 Python 使用相同的密钥 3. **密钥长度**: 密钥长度至少 32 字节(SHA256 要求) C# 示例代码: ```csharp var secret = "***************"; var payload = new { username = "admin", exp = DateTimeOffset.UtcNow.AddMinutes(30).ToUnixTimeSeconds(), iat = DateTimeOffset.UtcNow.ToUnixTimeSeconds() }; var token = Jose.JWT.Encode(payload, Encoding.UTF8.GetBytes(secret), JwsAlgorithm.HS256); ``` ### Python 客户端生成 Token ```python import jwt from datetime import datetime, timedelta secret = "***************" payload = { "username": "admin", "exp": datetime.utcnow() + timedelta(minutes=30), "iat": datetime.utcnow() } token = jwt.encode(payload, secret, algorithm="HS256") ``` ## 故障排查 ### 1. JWT Token 验证失败 **错误信息**: `Token 签名无效,请检查密钥是否正确` **解决方案**: - 检查 C# 和 Python 使用的密钥是否完全一致 - 确保密钥长度至少 32 字节 - 确认使用相同的算法(HS256) ### 2. 端口被占用 ```bash # 查看端口占用 netstat -ano | findstr :7080 # Windows lsof -i :7080 # Linux/macOS # 修改端口 python main.py --port=8001 ``` ### 3. Playwright 浏览器未安装 ```bash # 安装 Playwright 浏览器 playwright install firefox ``` ### 4. 容器无法启动 ```bash # 查看容器日志 docker logs crawlee-crawler # 进入容器调试 docker exec -it crawlee-crawler bash ``` ## 注意事项 1. **JWT 密钥安全**: 生产环境请使用强密钥,不要使用默认密钥 2. **浏览器资源**: Playwright 会占用较多内存,建议适当调整容器资源限制 3. **超时设置**: 根据目标网站的响应速度调整超时时间 4. **代理使用**: 使用代理时请确保代理服务稳定可靠 5. **Firefox 浏览器**: 所有 Playwright 爬虫默认使用 Firefox 浏览器 ## 许可证 MIT License ## 贡献 欢迎提交 Issue 和 Pull Request!