Python实战从PDF中提取（框线不全的）表格

雷大虾 · 发表于 2024-9-7 15:10:34

更多详情请点击查看原文：Python实战|从PDF中提取（框线不全的）表格Python教学专栏，旨在为初学者提供系统、全面的Python编程学习体验。通过逐步讲解Python基础语言和编程逻辑，结合实操案例，让小白也能轻松搞懂Python！>>>点击此处查看往期Python教学内容本文目录一、引言二、camelot-py介绍三、安装camelot-py四、camelot-py使用方法五、camelot-py的其他实用参数六、结束语七、相关推荐本文共7015个字，阅读大约需要18分钟，欢迎指正！一、引言社科同胞们一定有过收集/整理数据的经历吧，有时候一些原始数据被存放在大量的PDF文件中，例如上市公司公告公报中的各种指标信息，但如何快速地从大量的PDF中提取出那些表格却是一个难题。在过往的文章中，我们曾向大家分享过使用Python的pdfplumber库从PDF中读取表格的方法（>>>点击查看“一文读懂如何用python读取并处理PDF中的表格”），但经过长期使用，笔者注意到这个库在默认情况下解析时，对表格的要求非常之高。只有当表格的全部框线都存在时才能发挥作用，如果你要读取的表格框线不全，那么读取时极易丢失部分行或列。后来笔者找到了一个在表格框线不全时也能有不错解析效果的工具库，特此与大家分享使用方法和代码。二、camelot-py介绍一个基于Ghostscript的库，可以从PDF文件中提取表格数据，它使用了一种名为Lattice的算法，基于文本的近似排列来解析表，由此实现无框线（或框线不全）表格的解析，解析结果可以直接转为DataFrame，进而存储为Excel表。三、安装camelot-pycamelot库的安装命令如下：pip install camelot-py # 常规安装方式pip install camelot-py[cv] # 常规安装后如果调用报错，卸载后改用此命令， # 表示不仅安装 camelot自身，还会安装其他依赖库调用时发现camelot依赖PyPDF2库的特定版本，笔者的PyPDF2版本为2.2.2，可以正常运行。四、camelot-py使用方法笔者找到一个仅带有少量框线表格的某上市公司年度报告的PDF文件，表格位于第91页，如下图：图片来源：《深圳中航地产股份有限公司二○○七年年度报告》-成本法核算的其他股权投资下面是使用camelot读取该表格的Python代码：# 可以不导入 pandas，因为导入该库时会自动导入 pandasimport camelot.io as camelot# 解析表格result = camelot.read_pdf( filepath="001914_2007-12-31_2007.pdf", # 94 pages='94', flavor='stream', edge_tol=200, ) # 解析结果中可能包含多个表格，下面把解析到的第一个表格转为 DataFrame# 如果解析结果中不含表格，那么将会报错df = result[0].dfdf解析结果如下图：从解析结果来看，效果十分不错，唯一的问题在于表头的解析，这就需要对解析结果进行二次清洗，但总的来说效果已经很喜人了。五、camelot-py的其他实用参数让人欣喜的是，camelot并不是一个解析结果只能“看脸”的工具库，它还提供了很多可以干预或优化解析结果的参数，笔者将其中几个必要和实用的参数罗列在下表。参数名称取值描述filepath字符串pdf文件路径。pages字符串，如"91"、"1,2,3"、"91-end"、"all"从1开始算，必须是字符串，可以一次性解析多页，例如：'1,2,3'、'91-end'（表示从91页到最后一页）、'all'（全部页）。flavor'lattice'或'stream'；默认值为lattice针对不同类型的PDF表格指定解析方式，可选参数有'lattice'（格子解析）和'stream'（流解析），前者适用于解析带有完整框线的表格，后者常用于解析框线不全的表格。edge_tol数字，默认值为100指定表格边缘容差（边缘容忍度）。它是一个浮点数，用于控制识别表格边缘的容差范围。默认值为100，如果表格的某两行之间间隙稍大，导致表格解析被解析为多个表格，那么可以释放增加该参数的值，避免读取的表格不完整；或者减少参数值，这样当多个表之间的间隙不是特别大时也可以将其分开。split_textTrue或False，默认值为True当单元格中有分行的文本时，是否应该将它们分为多个单元格。strip_text字符串，默认值为空字符''去除单元格中的指定字符，默认值为''，即不清洗，如果需要取出多种不需要的字符，那么直接将多个字符组合成一个字符串传入即可。camelot库还有其他有用的参数，如果大家感兴趣，可以去查看源代码，笔者将源码中的参数介绍附在下方："""Read

DF and return extracted tables. Note: kwargs annotated with ^ can only be used with flavor='stream' and kwargs annotated with * can only be used with flavor='lattice'.

arameters ---------- filepath : str Filepath or URL of the

DF file. pages : str, optional (default: '1') Comma-separated page numbers. Example: '1,3,4' or '1,4-end' or 'all'. password : str, optional (default: None)

assword for decryption. flavor : str (default: 'lattice') The parsing method to use ('lattice' or 'stream'). Lattice is used by default. suppress_stdout : bool, optional (default: True)

rint all logs and warnings. layout_kwargs : dict, optional (default: {}) A dict of `pdfminer.layout.LAParams `_ kwargs. table_areas : list, optional (default: None) List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in

DF coordinate space. columns^ : list, optional (default: None) List of column x-coordinates strings where the coordinates are comma-separated. split_text : bool, optional (default: False) Split text that spans across multiple cells. flag_size : bool, optional (default: False) Flag text based on font size. Useful to detect super/subscripts. Adds around flagged text. strip_text : str, optional (default: '') Characters that should be stripped from a string before assigning it to a cell. row_tol^ : int, optional (default: 2) Tolerance parameter used to combine text vertically, to generate rows. column_tol^ : int, optional (default: 0) Tolerance parameter used to combine text horizontally, to generate columns. process_background* : bool, optional (default: False)

rocess background lines. line_scale* : int, optional (default: 15) Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines. copy_text* : list, optional (default: None) {'h', 'v'} Direction in which text in a spanning cell will be copied over. shift_text* : list, optional (default: ['l', 't']) {'l', 'r', 't', 'b'} Direction in which text in a spanning cell will flow. line_tol* : int, optional (default: 2) Tolerance parameter used to merge close vertical and horizontal lines. joint_tol* : int, optional (default: 2) Tolerance parameter used to decide whether the detected lines and points lie close to each other. threshold_blocksize* : int, optional (default: 15) Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on. For more information, refer `OpenCV's adaptiveThreshold `_. threshold_constant* : int, optional (default: -2) Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well. For more information, refer `OpenCV's adaptiveThreshold `_. iterations* : int, optional (default: 0) Number of times for erosion/dilation is applied. For more information, refer `OpenCV's dilate `_. resolution* : int, optional (default: 300) Resolution used for

DF to

		自动登录	找回密码
密码			会员注册