如何将使用pdf文件提取的日期与使用R提取的日期相关联？

Question

What I Have 我有的

I have two .pdf files that have a table inside with buy and sell stock information and a date on top right corner header of each page. 我有两个.pdf文件，每个文件的内部都有一个表格，其中包含买卖股票信息以及每个页面右上角标题上的日期。 See the files here . 在这里查看文件。 If necessary save the two .pdf files and the script below into the same folder in your computer and run the script to reproduce the problem. 如有必要，将下面的两个.pdf文件和脚本保存到计算机的同一文件夹中，然后运行脚本以重现该问题。

What I Need 我需要的

I want to extract just the table content from each file, join and transform it into a tibble and insert one first column (into a tibble) with dates extracted from header files. 我只想从每个文件中提取表内容，将其联接并转换为小标题，并插入一个第一列（小标题），其中包含从头文件中提取的日期。

So, if the first 5 lines in the tibble come from the first pdf file than the first 5 lines in the first column had to be filled in with the same date extract from the header of first file. 因此，如果小标题中的前5行来自第一个pdf文件，则第一列中的前5行必须用从第一个文件的标题中提取的相同日期填充。 If the next 2 lines after previous 5 ones come from the second file than these 2 lines in the first column had to be filled in with the same date extract from the header of the second file. 如果前5行之后的后2行来自第二个文件，则第一列中的这两行必须用从第二个文件的标题中提取的相同日期填充。

What I've Already Tried 我已经尝试过的

I've already extracted the table from each files, join and create a tibble as you can see below. 我已经从每个文件中提取了表格，加入并创建了一个小标题，如下所示。 Even create a code to extract the dates. 甚至创建代码以提取日期。 But really, I don't know how to associate the date extracted from header to a table content of each file and insert it into the tibble. 但实际上，我不知道如何将从标头提取的日期与每个文件的表内容相关联，并将其插入到标题中。

Code - Extract Table Information 代码-提取表信息

## EXTRACT PDF FILE INFORMATION AND GENERATE A CLEAN DATASET

# load library
library(pdftools)
library(tidyverse)


# create a list with all file names
file_names <- dir(pattern = 'N.*')


# extract text from each file and append into a list
text_raw <- list()
for (i in 1:length(file_names)) {
        doc <- pdf_text(file_names[i])
        text_raw <- append(text_raw, doc)
}


# clean data
text_clean <- text_raw %>% 
        str_split('\r\n') %>%
        unlist() %>% 
        as.vector() %>% 
        str_to_lower() %>% 
        str_squish() %>% 
        str_subset('1-bovespa') %>% 
        str_replace('1-', '') %>% 
        str_remove_all('#2?|on|nm|sa') %>% 
        str_squish()


# convert as tibble
df <- tbl_df(text_clean)

# split column
df <- separate(df, 
                value, 
                c('c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8'),
                sep = ' ')
print(df)

Code - Extract Dates 代码-提取日期

# filter dates
dates <- text_raw %>% 
        str_split('\r\n') %>% 
        unlist() %>% 
        as.vector() %>% 
        str_squish() %>% 
        str_subset('\\d{4}\\s\\d{1}\\s\\d{2}\\/\\d{2}\\/\\d{4}$') %>% 
        str_remove_all('(\\d+\\s\\d{1}\\s)')

print(dates)

Actual Output 实际产量

   c1       c2    c3    c4    c5    c6    c7        c8   
  <chr>    <chr> <chr> <chr> <chr> <chr> <chr>     <chr>
1 bovespa  c     vista cielo 800   10,79 8.632,00  d    
2 bovespa  c     vista cielo 200   10,79 2.158,00  d    
3 bovespa  c     vista brf   400   23,81 9.524,00  d    
4 bovespa  c     vista brf   100   23,81 2.381,00  d

Expected Output 预期产量

   c1           c2       c3    c4    c5    c6    c7     c8        c9
  <chr>        <chr>    <chr> <chr> <chr> <chr> <chr>  <chr>     <chr>
1 10/01/2019   bovespa  c     vista cielo 800   10,79  8.632,00  d    
2 10/01/2019   bovespa  c     vista cielo 200   10,79  2.158,00  d    
3 18/01/2019   bovespa  c     vista brf   400   23,81  9.524,00  d    
4 18/01/2019   bovespa  c     vista brf   100   23,81  2.381,00  d

Any help? 有什么帮助吗？

Answer 1

I thought the effort to extract dates was unnecessarily complex, not to mention the fact that it appears to have worked for some of us but failed for my running of the code. 我认为提取日期的工作不必要地复杂，更不用说它似乎对我们中的某些人有用，但对我的代码运行却失败了。 Instead I constructed a date-pattern and extracted with stringi::stri_extract : 相反，我构造了一个日期模式并用stringi::stri_extract提取：

 stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)
[1] "18/01/2019"  # this pattern designed for this century dates in the DD/MM/YYYY format

 dates <- stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)

 df$C9 <- dates

Furthermore, since there were multiple matches for the date pattern in each pdf, it would be safer to do the extraction before appending the text together and then you could use only the first values. 此外，由于每个pdf中的日期模式有多个匹配项，因此在将文本附加到一起之前进行提取会更安全，然后只能使用第一个值。

Answer 2

df$c0=dates
print(df)

hi,I am a chinese. 嗨，我是中国人。

you should just rename the colname : 您应该只重命名colname：

colnames(df)=c("c2","c3","c4","c5","c6","c7","c8","c9")
df$c1=dates
print(df)

如何将使用pdf文件提取的日期与使用R提取的日期相关联？

问题描述

What I Have 我有的

What I Need 我需要的

What I've Already Tried 我已经尝试过的

Actual Output 实际产量

Expected Output 预期产量

2 个解决方案

解决方案1
0 2019-05-19 19:05:01

解决方案2
-1 2019-05-19 18:39:33

如何将使用pdf文件提取的日期与使用R提取的日期相关联？

问题描述

What I Have 我有的

What I Need 我需要的

What I've Already Tried 我已经尝试过的

Actual Output 实际产量

Expected Output 预期产量

2 个解决方案

解决方案1 0 2019-05-19 19:05:01

解决方案2 -1 2019-05-19 18:39:33

解决方案1
0 2019-05-19 19:05:01

解决方案2
-1 2019-05-19 18:39:33