I am trying to extract few lines from a list of 300 lines which is prepared from a set of PDF files in a directory.
All the pdf files are in a list of 300 lines. Now I want to extract lines that has a matching word.
library(stringr)
library(pdftools)
library(tm)
library(tidyverse)
library(rex)
#Directory with multiple pdf files
files<- list.files(pattern='pdf$')
#Extract all files content into a list
lapply(files, function(x) strsplit(pdf_text(x), "\n")[[1]]) -> result
#change the type for ease of processing
mylist <- unlist(result) %>% str_split("\n")
#Squish all the words in a line together with space default
str_squish(mylist)
#Find lines that has a match with the mentioned string (ex: Table in t)
t <- grep("Table", mylist)
t1 <- grep("T[0-9]", mylist)
f <- grep("Figure", mylist)
f1 <- grep("F[0-9]", mylist)
l <- grep("Listing",mylist[1:300])
l1 <- grep("L[0-9]", mylist)
s <- grep("Source", mylist)
# Output of t with indices where there is a match for string "Table"
> t
[1] 46 71 95 124 153 250 278
#Now how to print these indices values to a new list? or Do i go back to mylist and pass the indices numbers and extract it from mylist. What is the best way to do it ?
----------------------------
when I run these lines of code (t,t1,f,f1,l,l1,s) I get the indices of the matching string in that line.
below is the image with output showing lines where it has a match.
Now I just need to print those lines to another list. How do I do that, Please advise.
Without test data it's difficult to say, the code below is untested.
Put the patterns in a list and lapply/grep
with value = TRUE
. This returns a list with each member a vector of the matching strings.
search_list <- list("Table", "T[0-9]", "Figure", "F[0-9]", "Listing", "L[0-9]", "Source")
matches_list <- lapply(search_list, grep, x = mylist, value = TRUE)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.