简体   繁体   English

从包含电子邮件的PDF中提取名称

[英]Extracting names from PDFs containing emails

I have a very specific question. 我有一个非常具体的问题。 I have a set of PDF files that contain emails (and email chains) and are generally of the following format: 我有一组包含电子邮件(和电子邮件链)的PDF文件,通常具有以下格式:

From: Doe, John <john.doe@mail.com>
To: Doe, Jane <john.doe@mail.com>; Doe, John
Subject: Re: Title
text ...
...
From: Doe, John <john.doe@mail.com>
To: Doe, Jane <john.doe@mail.com>; Doe, John
CC: Moe, James; Klein, John
Subject: Title
text ...

So, in one PDF file, you generally have several "from", "to" and "cc" blocks. 因此,在一个PDF文件中,通常会有多个“从”,“到”和“ cc”块。 The format of the names is always that last name and first name are seperated by a comma. 名称的格式始终是姓氏和名字之间用逗号分隔。 Different names are separated by a semi colon. 不同的名称之间用半冒号分隔。 However, sometimes the full email address (which I do not need) will be included between "<" and ">". 但是,有时完整的电子邮件地址(我不需要)将包含在“ <”和“>”之间。 I would like to extract all of the names (in the from, to, and cc parts) from these PDF files and in the end have output that looks like this: 我想从这些PDF文件中提取所有名称(在from,to和cc部分中),最后输出如下所示:

Last name    first name
Doe          John
Doe          Jane
Moe          James
Klein        John

I have managed to read in the PDF files using the pdftools package: 我设法使用pdftools包读取了PDF文件:

files <- list.files(pattern = "pdf$")
pdfs <- lapply(files, pdf_text)

However, I am currently a bit stuck in trying to find the best way to extract all of the names and save them in a data frame. 但是,我目前在尝试寻找提取所有名称并将其保存在数据框中的最佳方法时有些困惑。 I have been looking at the str_extract function: eg starting with str_extract(pdfs[[1]], regex("From.*To", ignore_case = TRUE)) , but haven't been able to find a working solution. 我一直在看str_extract函数:例如,以str_extract(pdfs[[1]], regex("From.*To", ignore_case = TRUE)) ,但是找不到有效的解决方案。 Any help would be much appreciated. 任何帮助将非常感激。 As an example, assume that pdfs[[1]] contains the following string: 例如,假设pdfs[[1]]包含以下字符串:

teststring <- "From: Doe, John <john.doe@mail.com>\r\n
To: Doe, Jane <john.doe@mail.com>; Doe, John\r\n
Subject: Re: Title\r\n
text ...\r\n
...\r\n
From: Doe, John <john.doe@mail.com>\r\n
To: Doe, Jane <john.doe@mail.com>; Doe, John\r\n
CC: Moe, James; Klein, John\r\n
Subject: Title\r\n
text ...\r\n"

Try this, using teststring 使用teststring尝试teststring

library(stringr)
fullnames <- unique(c(str_extract_all(teststring, "[a-zA-Z]+,\\s[a-zA-Z]+", simplify=TRUE)))
splitnames <- unlist(strsplit(fullnames, ","))
ans <- data.frame(Last=splitnames[c(TRUE,FALSE)], First=splitnames[c(FALSE,TRUE)])

Output 产量

   Last  First
1   Doe   John
2   Doe   Jane
3   Moe  James
4 Klein   John

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM