从基于正则表达式的pdf文件中提取文本？

Question

i have a pdf file which have number of 300 pages, and each set of pages contains identifying information for a person such as the social security number. 我有一个PDF文件，该文件的页数为300页，每组页包含一个人的身份信息，例如社会安全号码。

let's say that pages from 1-4 are for the social number 987-65-4320 and pages from 5-6 are for 987-65-4321 假设1-4的页面代表987-65-4320，而5-6的页面代表987-65-4321

i want to extract all the information for the first employee starting from the first social number position to the second social number position then save them in a new pdf file. 我要提取第一位员工的所有信息（从第一个社交号码位置到第二个社交号码位置），然后将它们保存在新的pdf文件中。

all the examples i saw was about extracting all the text from pdf file, not based on specific criteria like this one: 我看到的所有示例都是关于从pdf文件中提取所有文本的，而不是基于像这样的特定标准：

extract text from pdf files 从pdf文件中提取文本

please advise how to accomplish that. 请告知如何完成此操作。

Answer 1

This isn't an automated technique, but can you get the text (I might just copy-paste the pdf into a text file), and use a regular expression to find the information you want? 这不是自动化的技术，但是您可以获取文本（我可能只是将pdf复制粘贴到文本文件中），并使用正则表达式来查找所需的信息？

In Java, some of the parsing could look like: 在Java中，某些解析可能类似于：

// Matches 3 digits, a dash, 2 digits, a dash, and four digits, and then all text
// until it finds another SSN
String text = "987-65-4320 some info 987-65-4321 other \ninfo";
Pattern p = Pattern.compile("(\\d{3}-\\d{2}-\\d{4})((?:.(?!\\d{3}-\\d{2}-\\d{4}))*)", Pattern.DOTALL);
Matcher m = p.matcher(text);
while (m.find())
    System.out.println(m.group(1) + ": " + m.group(2));

but without seeing the information you want to save I couldn't help you with getting it. 但没有看到您要保存的信息，我无济于事。

If I wanted a new PDF I would put the information into Microsoft Word or Google Docs and save a PDF. 如果需要新的PDF，可以将信息放入Microsoft Word或Google Docs中并保存PDF。

Alternatively , if all you want is to to "extract all the information" from a range of employees, then would it work to create a copy of the original PDF with some pages removed? 或者，如果您只想从一系列员工中“提取所有信息”，那么创建去除某些页面的原始PDF副本是否可行？ I've seen websites that let you do that, but Chrome's (you can use it to open local PDFs without a problem) print dialogue will let you specify a range of pages, and save it as a PDF. 我见过允许您这样做的网站，但是Chrome的（您可以使用它来打开本地PDF，没有问题）打印对话框将允许您指定页面范围，并将其另存为PDF。

从基于正则表达式的pdf文件中提取文本？

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-07-17 17:52:28

从基于正则表达式的pdf文件中提取文本？

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-07-17 17:52:28

解决方案1
1 已采纳 2012-07-17 17:52:28