[英]How to associate a date extracted from a pdf file with the data extracted from it using R?
I have two .pdf files that have a table inside with buy and sell stock information and a date on top right corner header of each page. 我有两个.pdf文件,每个文件的内部都有一个表格,其中包含买卖股票信息以及每个页面右上角标题上的日期。 See the files here .
在这里查看文件。 If necessary save the two .pdf files and the script below into the same folder in your computer and run the script to reproduce the problem.
如有必要,将下面的两个.pdf文件和脚本保存到计算机的同一文件夹中,然后运行脚本以重现该问题。
I want to extract just the table content from each file, join and transform it into a tibble and insert one first column (into a tibble) with dates extracted from header files. 我只想从每个文件中提取表内容,将其联接并转换为小标题,并插入一个第一列(小标题),其中包含从头文件中提取的日期。
So, if the first 5 lines in the tibble come from the first pdf file than the first 5 lines in the first column had to be filled in with the same date extract from the header of first file. 因此,如果小标题中的前5行来自第一个pdf文件,则第一列中的前5行必须用从第一个文件的标题中提取的相同日期填充。 If the next 2 lines after previous 5 ones come from the second file than these 2 lines in the first column had to be filled in with the same date extract from the header of the second file.
如果前5行之后的后2行来自第二个文件,则第一列中的这两行必须用从第二个文件的标题中提取的相同日期填充。
I've already extracted the table from each files, join and create a tibble as you can see below. 我已经从每个文件中提取了表格,加入并创建了一个小标题,如下所示。 Even create a code to extract the dates.
甚至创建代码以提取日期。 But really, I don't know how to associate the date extracted from header to a table content of each file and insert it into the tibble.
但实际上,我不知道如何将从标头提取的日期与每个文件的表内容相关联,并将其插入到标题中。
Code - Extract Table Information 代码-提取表信息
## EXTRACT PDF FILE INFORMATION AND GENERATE A CLEAN DATASET
# load library
library(pdftools)
library(tidyverse)
# create a list with all file names
file_names <- dir(pattern = 'N.*')
# extract text from each file and append into a list
text_raw <- list()
for (i in 1:length(file_names)) {
doc <- pdf_text(file_names[i])
text_raw <- append(text_raw, doc)
}
# clean data
text_clean <- text_raw %>%
str_split('\r\n') %>%
unlist() %>%
as.vector() %>%
str_to_lower() %>%
str_squish() %>%
str_subset('1-bovespa') %>%
str_replace('1-', '') %>%
str_remove_all('#2?|on|nm|sa') %>%
str_squish()
# convert as tibble
df <- tbl_df(text_clean)
# split column
df <- separate(df,
value,
c('c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8'),
sep = ' ')
print(df)
Code - Extract Dates 代码-提取日期
# filter dates
dates <- text_raw %>%
str_split('\r\n') %>%
unlist() %>%
as.vector() %>%
str_squish() %>%
str_subset('\\d{4}\\s\\d{1}\\s\\d{2}\\/\\d{2}\\/\\d{4}$') %>%
str_remove_all('(\\d+\\s\\d{1}\\s)')
print(dates)
c1 c2 c3 c4 c5 c6 c7 c8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 bovespa c vista cielo 800 10,79 8.632,00 d
2 bovespa c vista cielo 200 10,79 2.158,00 d
3 bovespa c vista brf 400 23,81 9.524,00 d
4 bovespa c vista brf 100 23,81 2.381,00 d
c1 c2 c3 c4 c5 c6 c7 c8 c9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 10/01/2019 bovespa c vista cielo 800 10,79 8.632,00 d
2 10/01/2019 bovespa c vista cielo 200 10,79 2.158,00 d
3 18/01/2019 bovespa c vista brf 400 23,81 9.524,00 d
4 18/01/2019 bovespa c vista brf 100 23,81 2.381,00 d
Any help? 有什么帮助吗?
I thought the effort to extract dates was unnecessarily complex, not to mention the fact that it appears to have worked for some of us but failed for my running of the code. 我认为提取日期的工作不必要地复杂,更不用说它似乎对我们中的某些人有用,但对我的代码运行却失败了。 Instead I constructed a date-pattern and extracted with
stringi::stri_extract
: 相反,我构造了一个日期模式并用
stringi::stri_extract
提取:
stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)
[1] "18/01/2019" # this pattern designed for this century dates in the DD/MM/YYYY format
dates <- stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)
df$C9 <- dates
Furthermore, since there were multiple matches for the date pattern in each pdf, it would be safer to do the extraction before appending the text together and then you could use only the first values. 此外,由于每个pdf中的日期模式有多个匹配项,因此在将文本附加到一起之前进行提取会更安全,然后只能使用第一个值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.