简体   繁体   中英

Subset a list of strings by the first word

Is it possible to subset a list of strings eg List[1:3] using grepl? I want to identify the first word in a character string to begin the index and end the index with the first word of the string that matches.

The reason I don't want to use the numeric index is that I plan on subsetting multiple financial statement pdfs and they may differ in terms of what is contained in the list.

Here is the data I have:

list(c("CASH $99,999,999.00 $99,999,999.00 0.00"), 
    c("CASH SLIPS 1,000,000.00 1,000,000.00 0.00"), 
    c("BONDS 500,000.00 (500,000.00)"), 
    c("ACCOUNTS RECEIVABLE 1,000,000.00 2,000,000.00 (1,000,000.00)"))

How would I subset by beginning at CASH ie the exact match, not CASH SLIPS, and end at BONDS?

Desired output:

list(c("CASH $99,999,999.00 $99,999,999.00 0.00"), 
        c("CASH SLIPS 1,000,000.00 1,000,000.00 0.00"), 
        c("BONDS 500,000.00 (500,000.00)"))

We can use Filter from base R

Filter(function(x) grepl("^(CASH|BONDS)", x), lst1)
#[[1]]
#[1] "CASH $99,999,999.00 $99,999,999.00 0.00"

#[[2]]
#[1] "CASH SLIPS 1,000,000.00 1,000,000.00 0.00"

#[[3]]
#[1] "BONDS 500,000.00 (500,000.00)"

Or another option if we want to subset based on the starting index of 'CASH' and ending index of 'BONDS'

i1 <- sub("\\s+[^A-Z]+", "", unlist(lst1)) %in% c("CASH", "BONDS")
lst1[Reduce(`:`, as.list(range(which(i1))))]
#[[1]]
#[1] "CASH $99,999,999.00 $99,999,999.00 0.00"

#[[2]]
#[1] "CASH SLIPS 1,000,000.00 1,000,000.00 0.00"

#[[3]]
#[1] "BONDS 500,000.00 (500,000.00)"

Or using grepl

lst1[Reduce(`:`, as.list(range(grep("^(CASH|BONDS)\\s+([^A-Z])", unlist(lst1)))))]
#[[1]]
#[1] "CASH $99,999,999.00 $99,999,999.00 0.00"

#[[2]]
#[1] "CASH SLIPS 1,000,000.00 1,000,000.00 0.00"

#[[3]]
#[1] "BONDS 500,000.00 (500,000.00)"

or using keep from purrr

library(stringr)
library(purrr)
keep(lst1, ~ str_detect(.x, '^(CASH|BONDS)'))

Or with sapply and word

lst1[sapply(lst1, word, 1) %in% c("CASH", "BONDS")]

You can extract the words until a number or "$" sign is encountered for each list.

first_word <- sapply(data, function(x) sub('(.*?)\\s(\\d+|\\$).*', '\\1', x))
first_word
#[1] "CASH"    "CASH SLIPS"     "BONDS"    "ACCOUNTS RECEIVABLE"

and use these first_word to select elements from the list which start with "CASH" and end at "BONDS" .

data[which(first_word == 'CASH'):which(first_word == 'BONDS')]

#[[1]]
#[1] "CASH $99,999,999.00 $99,999,999.00 0.00"

#[[2]]
#[1] "CASH SLIPS 1,000,000.00 1,000,000.00 0.00"

#[[3]]
#[1] "BONDS 500,000.00 (500,000.00)"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM