简体   繁体   English

从基于正则表达式的pdf文件中提取文本?

[英]Extract text from pdf file based on regular expression?

i have a pdf file which have number of 300 pages, and each set of pages contains identifying information for a person such as the social security number. 我有一个PDF文件,该文件的页数为300页,每组页包含一个人的身份信息,例如社会安全号码。

let's say that pages from 1-4 are for the social number 987-65-4320 and pages from 5-6 are for 987-65-4321 假设1-4的页面代表987-65-4320,而5-6的页面代表987-65-4321

i want to extract all the information for the first employee starting from the first social number position to the second social number position then save them in a new pdf file. 我要提取第一位员工的所有信息( 一个社交号码位置第二个社交号码位置),然后将它们保存在新的pdf文件中。

all the examples i saw was about extracting all the text from pdf file, not based on specific criteria like this one: 我看到的所有示例都是关于从pdf文件中提取所有文本的,而不是基于像这样的特定标准:

extract text from pdf files 从pdf文件中提取文本

please advise how to accomplish that. 请告知如何完成此操作。

This isn't an automated technique, but can you get the text (I might just copy-paste the pdf into a text file), and use a regular expression to find the information you want? 这不是自动化的技术,但是您可以获取文本(我可能只是将pdf复制粘贴到文本文件中),并使用正则表达式来查找所需的信息?

In Java, some of the parsing could look like: 在Java中,某些解析可能类似于:

// Matches 3 digits, a dash, 2 digits, a dash, and four digits, and then all text
// until it finds another SSN
String text = "987-65-4320 some info 987-65-4321 other \ninfo";
Pattern p = Pattern.compile("(\\d{3}-\\d{2}-\\d{4})((?:.(?!\\d{3}-\\d{2}-\\d{4}))*)", Pattern.DOTALL);
Matcher m = p.matcher(text);
while (m.find())
    System.out.println(m.group(1) + ": " + m.group(2));

but without seeing the information you want to save I couldn't help you with getting it. 但没有看到您要保存的信息,我无济于事。

If I wanted a new PDF I would put the information into Microsoft Word or Google Docs and save a PDF. 如果需要新的PDF,可以将信息放入Microsoft Word或Google Docs中并保存PDF。

Alternatively , if all you want is to to "extract all the information" from a range of employees, then would it work to create a copy of the original PDF with some pages removed? 或者 ,如果您只想从一系列员工中“提取所有信息”,那么创建去除某些页面的原始PDF副本是否可行? I've seen websites that let you do that, but Chrome's (you can use it to open local PDFs without a problem) print dialogue will let you specify a range of pages, and save it as a PDF. 我见过允许您这样做的网站,但是Chrome的(您可以使用它来打开本地PDF,没有问题)打印对话框将允许您指定页面范围,并将其另存为PDF。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM