简体   繁体   English

如何将使用pdf文件提取的日期与使用R提取的日期相关联?

[英]How to associate a date extracted from a pdf file with the data extracted from it using R?

What I Have 我有的

I have two .pdf files that have a table inside with buy and sell stock information and a date on top right corner header of each page. 我有两个.pdf文件,每个文件的内部都有一个表格,其中包含买卖股票信息以及每个页面右上角标题上的日期。 See the files here . 这里查看文件。 If necessary save the two .pdf files and the script below into the same folder in your computer and run the script to reproduce the problem. 如有必要,将下面的两个.pdf文件和脚本保存到计算机的同一文件夹中,然后运行脚本以重现该问题。

What I Need 我需要的

I want to extract just the table content from each file, join and transform it into a tibble and insert one first column (into a tibble) with dates extracted from header files. 我只想从每个文件中提取表内容,将其联接并转换为小标题,并插入一个第一列(小标题),其中包含从头文件中提取的日期。

So, if the first 5 lines in the tibble come from the first pdf file than the first 5 lines in the first column had to be filled in with the same date extract from the header of first file. 因此,如果小标题中的前5行来自第一个pdf文件,则第一列中的前5行必须用从第一个文件的标题中提取的相同日期填充。 If the next 2 lines after previous 5 ones come from the second file than these 2 lines in the first column had to be filled in with the same date extract from the header of the second file. 如果前5行之后的后2行来自第二个文件,则第一列中的这两行必须用从第二个文件的标题中提取的相同日期填充。

What I've Already Tried 我已经尝试过的

I've already extracted the table from each files, join and create a tibble as you can see below. 我已经从每个文件中提取了表格,加入并创建了一个小标题,如下所示。 Even create a code to extract the dates. 甚至创建代码以提取日期。 But really, I don't know how to associate the date extracted from header to a table content of each file and insert it into the tibble. 但实际上,我不知道如何将从标头提取的日期与每个文件的表内容相关联,并将其插入到标题中。

Code - Extract Table Information 代码-提取表信息

## EXTRACT PDF FILE INFORMATION AND GENERATE A CLEAN DATASET

# load library
library(pdftools)
library(tidyverse)


# create a list with all file names
file_names <- dir(pattern = 'N.*')


# extract text from each file and append into a list
text_raw <- list()
for (i in 1:length(file_names)) {
        doc <- pdf_text(file_names[i])
        text_raw <- append(text_raw, doc)
}


# clean data
text_clean <- text_raw %>% 
        str_split('\r\n') %>%
        unlist() %>% 
        as.vector() %>% 
        str_to_lower() %>% 
        str_squish() %>% 
        str_subset('1-bovespa') %>% 
        str_replace('1-', '') %>% 
        str_remove_all('#2?|on|nm|sa') %>% 
        str_squish()


# convert as tibble
df <- tbl_df(text_clean)

# split column
df <- separate(df, 
                value, 
                c('c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8'),
                sep = ' ')
print(df)

Code - Extract Dates 代码-提取日期

# filter dates
dates <- text_raw %>% 
        str_split('\r\n') %>% 
        unlist() %>% 
        as.vector() %>% 
        str_squish() %>% 
        str_subset('\\d{4}\\s\\d{1}\\s\\d{2}\\/\\d{2}\\/\\d{4}$') %>% 
        str_remove_all('(\\d+\\s\\d{1}\\s)')

print(dates)

Actual Output 实际产量

   c1       c2    c3    c4    c5    c6    c7        c8   
  <chr>    <chr> <chr> <chr> <chr> <chr> <chr>     <chr>
1 bovespa  c     vista cielo 800   10,79 8.632,00  d    
2 bovespa  c     vista cielo 200   10,79 2.158,00  d    
3 bovespa  c     vista brf   400   23,81 9.524,00  d    
4 bovespa  c     vista brf   100   23,81 2.381,00  d   

Expected Output 预期产量

   c1           c2       c3    c4    c5    c6    c7     c8        c9
  <chr>        <chr>    <chr> <chr> <chr> <chr> <chr>  <chr>     <chr>
1 10/01/2019   bovespa  c     vista cielo 800   10,79  8.632,00  d    
2 10/01/2019   bovespa  c     vista cielo 200   10,79  2.158,00  d    
3 18/01/2019   bovespa  c     vista brf   400   23,81  9.524,00  d    
4 18/01/2019   bovespa  c     vista brf   100   23,81  2.381,00  d   

Any help? 有什么帮助吗?

I thought the effort to extract dates was unnecessarily complex, not to mention the fact that it appears to have worked for some of us but failed for my running of the code. 我认为提取日期的工作不必要地复杂,更不用说它似乎对我们中的某些人有用,但对我的代码运行却失败了。 Instead I constructed a date-pattern and extracted with stringi::stri_extract : 相反,我构造了一个日期模式并用stringi::stri_extract提取:

 stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)
[1] "18/01/2019"  # this pattern designed for this century dates in the DD/MM/YYYY format

 dates <- stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)

 df$C9 <- dates

Furthermore, since there were multiple matches for the date pattern in each pdf, it would be safer to do the extraction before appending the text together and then you could use only the first values. 此外,由于每个pdf中的日期模式有多个匹配项,因此在将文本附加到一起之前进行提取会更安全,然后只能使用第一个值。

df$c0=dates
print(df)

hi,I am a chinese. 嗨,我是中国人。

在此处输入图片说明

you should just rename the colname : 您应该只重命名colname:

colnames(df)=c("c2","c3","c4","c5","c6","c7","c8","c9")
df$c1=dates
print(df)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法使用 R 中的 readtext Package 中的 readtext() 替换从 PDF 文件中提取的文本中的“\r\n-” - Unable to Replace “\r\n-” in Text Extracted from PDF File Using readtext() from readtext Package in R 如何将从graphml提取的数据转换为R中所需的多个列 - How to convert data extracted from graphml to desired multiple columns in R 从 pdf 中提取的优化表 - Tabulizer - Refine table extracted from pdf - Tabulizer 如何在R Studio中根据从XML提取的数据帧创建Choropleth映射 - How to create choropleth map in R studio from a data frame extracted from XML 如何提取多个 XML 文件的文件属性并将它们与 XML 提取的数据组合(使用 R) - How to extract file properties of multiple XML files and combine them with the XML extracted data (Using R) 使用从 R 的 tidyverse 'map' 的 output 中提取的 lm 使用 'segmented' 时出错 - Error using 'segmented' with lm extracted from output of tidyverse 'map' in R 如何将从其他Excel工作表中提取的数据合并到R中的一个最终Excel工作表中? - How to merge data extracted from other excel sheets into one final excel sheet in r? 如何使用从 R 中的现有列中提取的名称向 data.frame 添加列? - How add a column to a data.frame with name extracted from an existing column in R? 从R中无法识别的postgres提取XML内容 - Extracted XML content from postgres not recognized in R 从 R 中的 glm 和 lmer 提取的残差方差 - Residual variance extracted from glm and lmer in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM