简体   繁体   English

子集 R 按子列表元素与字符向量的部分字符串匹配列出,使用基数 R

[英]Subset R list by partial string match of sublist element against character vector, using base R

My actual case is a list of combined header strings and corresponding data as sub-lists;我的实际情况是一个组合的列表 header 字符串和相应的数据作为子列表; I wish to subset the list to return a list of sub-lists, ie the same structure, that only contain the sub-lists whose header strings contain strings that match the strings in a character vector.我希望对列表进行子集化以返回子列表列表,即相同的结构,只包含子列表,其 header 字符串包含与字符向量中的字符串匹配的字符串。

Test Data:测试数据:

lets <- letters
x <- c(1,4,8,11,13,14,18,22,24)

ls <- list()
for (i in 1:9) {
  ls[[i]] <- list(hdr = paste(lets[x[i]:(x[i]+2)], collapse=""), 
                  data = seq(1,rnd[i]))
}

filt <- c("bc", "lm", "rs", "xy")

To produce a result list, as returned by:生成结果列表,如返回的:

logical_match <- c(T, F, F, T, F, F, T, F, T) 
ls_result <- ls[logical_match]

So the function I seek is: ls_result <- fn(ls, filt)所以我寻找的 function 是:ls_result <- fn(ls, filt)

I've looked at: subset list by dataframe ;我看过: dataframe 的子集列表 partial match with %in% ; 与 %in% 部分匹配 nested sublist by condition ; 按条件嵌套子列表 subset list by logical condition ; 按逻辑条件的子集列表 and, my favorite, extract sublist elements to array - this uses some neat purr and dplyr solutions, but unfortunately these aren't viable, as I'm looking for a base R solution to make deployment more straightforward (I'd welcome extended R solutions, for interest, of course).而且,我最喜欢的是, 将子列表元素提取到数组中——这使用了一些简洁的 purr 和 dplyr 解决方案,但不幸的是这些都不可行,因为我正在寻找一个基本的 R 解决方案来使部署更直接(我欢迎扩展 R解决方案,当然是出于兴趣)。

I'm guessing some variation of logical_match <- lapply(ls, fn, '$hdr', filt) is where I'm heading;我猜 logical_match <- lapply(ls, fn, '$hdr', filt) 的一些变体是我要去的地方; I started with pmatch(), and wondered how to incorporate grep, but I'm struggling to see how to generate the logical_match vector.我从 pmatch() 开始,想知道如何合并 grep,但我正在努力了解如何生成 logical_match 向量。

Can someone set me on the right track, please?有人可以让我走上正轨吗?

EDIT: when agrepl() is applied to the real data, this becomes trickier;编辑:当agrepl()应用于真实数据时,这变得更加棘手; the header string, hdr , may be typically 255 characters long, whilst a string element of the filter vector, filt is of the order of 16 characters. header 字符串hdr通常可能有 255 个字符长,而过滤器向量的字符串元素filt大约为 16 个字符。 The default agrepl() max.distance argument of 0.1 needs adjusted to somewhere between 0.94 and 0.96 for the example below, which is pretty tight.对于下面的示例,默认的agrepl() max.distance参数0.1需要调整到 0.94 和 0.96 之间,这非常紧凑。 Even if I use the lower end of this range, and apply it to the ~360 list elements, the function returns a load of total non-matches.即使我使用此范围的下限,并将其应用于 ~360 个列表元素,function 也会返回大量不匹配项。

> hdr <- "#CCHANNELSDI12-PTx|*|CCHANNELNO2|*|CDASA1570|*|CDASANAMEShenachieBU_1570|*|CTAGSODATSID|*|CTAGKEYWISKI_LIVE,ShenachieBU_1570,SDI12-PTx,Highres|*|LAYOUT(timestamp,value)|*|RINVAL-777|*|RSTATEW6|*|RTIMELVLhigh-resolution|*|TZEtc/GMT|*|ZDATE20210110130805|*|"

> filt <- c("ShenachieBU_1570", "Pitlochry_4056")

> agrepl(hdr, filt, max.distance = 0.94)
[1]  TRUE FALSE

You could do:你可以这样做:

Filter(function(x)any(agrepl(x$hdr,filt)), ls)

You could reduce the code to:您可以将代码缩减为:

Filter(function(x)grepl(paste0(filt, collapse = "|"), x$hdr), ls)

We can also do我们也可以做

library(purrr)
library(stringr)
keep(ls, ~ str_detect(.x$hdr, str_c(filt, collapse = "|")))

-output -输出

#[[1]]
#[[1]]$hdr
#[1] "abc"

#[[1]]$data
#[1] 1


#[[2]]
#[[2]]$hdr
#[1] "klm"

#[[2]]$data
#[1] 1 2 3 4


#[[3]]
#[[3]]$hdr
#[1] "rst"

#[[3]]$data
#[1] 1 2 3 4 5 6 7


#[[4]]
#[[4]]$hdr
#[1] "xyz"

#[[4]]$data
#[1] 1 2 3 4 5 6 7 8 9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM