简体   繁体   中英

How to associate a date extracted from a pdf file with the data extracted from it using R?

What I Have

I have two .pdf files that have a table inside with buy and sell stock information and a date on top right corner header of each page. See the files here . If necessary save the two .pdf files and the script below into the same folder in your computer and run the script to reproduce the problem.

What I Need

I want to extract just the table content from each file, join and transform it into a tibble and insert one first column (into a tibble) with dates extracted from header files.

So, if the first 5 lines in the tibble come from the first pdf file than the first 5 lines in the first column had to be filled in with the same date extract from the header of first file. If the next 2 lines after previous 5 ones come from the second file than these 2 lines in the first column had to be filled in with the same date extract from the header of the second file.

What I've Already Tried

I've already extracted the table from each files, join and create a tibble as you can see below. Even create a code to extract the dates. But really, I don't know how to associate the date extracted from header to a table content of each file and insert it into the tibble.

Code - Extract Table Information

## EXTRACT PDF FILE INFORMATION AND GENERATE A CLEAN DATASET

# load library
library(pdftools)
library(tidyverse)


# create a list with all file names
file_names <- dir(pattern = 'N.*')


# extract text from each file and append into a list
text_raw <- list()
for (i in 1:length(file_names)) {
        doc <- pdf_text(file_names[i])
        text_raw <- append(text_raw, doc)
}


# clean data
text_clean <- text_raw %>% 
        str_split('\r\n') %>%
        unlist() %>% 
        as.vector() %>% 
        str_to_lower() %>% 
        str_squish() %>% 
        str_subset('1-bovespa') %>% 
        str_replace('1-', '') %>% 
        str_remove_all('#2?|on|nm|sa') %>% 
        str_squish()


# convert as tibble
df <- tbl_df(text_clean)

# split column
df <- separate(df, 
                value, 
                c('c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8'),
                sep = ' ')
print(df)

Code - Extract Dates

# filter dates
dates <- text_raw %>% 
        str_split('\r\n') %>% 
        unlist() %>% 
        as.vector() %>% 
        str_squish() %>% 
        str_subset('\\d{4}\\s\\d{1}\\s\\d{2}\\/\\d{2}\\/\\d{4}$') %>% 
        str_remove_all('(\\d+\\s\\d{1}\\s)')

print(dates)

Actual Output

   c1       c2    c3    c4    c5    c6    c7        c8   
  <chr>    <chr> <chr> <chr> <chr> <chr> <chr>     <chr>
1 bovespa  c     vista cielo 800   10,79 8.632,00  d    
2 bovespa  c     vista cielo 200   10,79 2.158,00  d    
3 bovespa  c     vista brf   400   23,81 9.524,00  d    
4 bovespa  c     vista brf   100   23,81 2.381,00  d   

Expected Output

   c1           c2       c3    c4    c5    c6    c7     c8        c9
  <chr>        <chr>    <chr> <chr> <chr> <chr> <chr>  <chr>     <chr>
1 10/01/2019   bovespa  c     vista cielo 800   10,79  8.632,00  d    
2 10/01/2019   bovespa  c     vista cielo 200   10,79  2.158,00  d    
3 18/01/2019   bovespa  c     vista brf   400   23,81  9.524,00  d    
4 18/01/2019   bovespa  c     vista brf   100   23,81  2.381,00  d   

Any help?

I thought the effort to extract dates was unnecessarily complex, not to mention the fact that it appears to have worked for some of us but failed for my running of the code. Instead I constructed a date-pattern and extracted with stringi::stri_extract :

 stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)
[1] "18/01/2019"  # this pattern designed for this century dates in the DD/MM/YYYY format

 dates <- stringi::stri_extract( regex="[0-3][0-9]/[01][0-9]/20[0-1][0-9]", text_clean)

 df$C9 <- dates

Furthermore, since there were multiple matches for the date pattern in each pdf, it would be safer to do the extraction before appending the text together and then you could use only the first values.

df$c0=dates
print(df)

hi,I am a chinese.

在此处输入图片说明

you should just rename the colname :

colnames(df)=c("c2","c3","c4","c5","c6","c7","c8","c9")
df$c1=dates
print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM