简体   繁体   English

使用 R,strsplit 如何在字符串末尾使用拆分器拆分的固定元素进行拆分?

[英]Using R, how does strsplit work on fixed elements with the splitter at the end of the string to split?

I was working on a language parser and I wanted to count certain string elements (say "</i>" ) in a larger string.我正在研究一个语言解析器,我想在一个更大的字符串中计算某些字符串元素(比如"</i>" )。 Since the string has been cleansed ( str.trim ), it doesn't have any content after it.由于字符串已被清理( str.trim ),它后面没有任何内容。 I was getting some weird behavior on strsplit as it seems to behave differently if the separator sep (called split in RTM) is at the beginning or end of the string.我在strsplit上遇到了一些奇怪的行为,因为如果分隔符sep (在 RTM 中称为split )位于字符串的开头或结尾,它的行为似乎有所不同。

Below is an example:下面是一个例子:

str1 = "<i>hello friend</i>"; 
str2 = paste0(" ",str1);
str3 = paste0(str1, " ");

sep1="<i>";
sep2="</i>";

str = c(str1, str2, str3);  n = length(str);
sep = c(sep1, sep2);        ns = length(sep);

base = matrix("", nrow=n, ncol=ns);
rownames(base) = str; colnames(base) = sep;
for(i in 1:n)
    {
    for(j in 1:ns)
        {
        base[i, j] = paste0(base::strsplit(str[i], sep[j], fixed=TRUE)[[1]], collapse="|");
        }   
    }
base;
    
stringi = matrix("", nrow=n, ncol=ns);
rownames(stringi) = str; colnames(stringi) = sep;
for(i in 1:n)
    {
    for(j in 1:ns)
        {
        stringi[i, j] = paste0(stringi::stri_split_fixed(str[i], sep[j])[[1]], collapse="|");
        }   
    }
stringi;

stopifnot(identical(base,stringi));

The output for base : output 用于基础

> base;
                     <i>                  </i>               
<i>hello friend</i>  "|hello friend</i>"  "<i>hello friend"  
 <i>hello friend</i> " |hello friend</i>" " <i>hello friend" 
<i>hello friend</i>  "|hello friend</i> " "<i>hello friend| "

The output for stringi :用于 stringi 的output

> stringi;
                     <i>                  </i>               
<i>hello friend</i>  "|hello friend</i>"  "<i>hello friend|" 
 <i>hello friend</i> " |hello friend</i>" " <i>hello friend|"
<i>hello friend</i>  "|hello friend</i> " "<i>hello friend| "

The core difference is ROW=1, COL=2...核心区别是ROW=1,COL=2...

Question: What is E[strsplit] ?问题:什么是E[strsplit]

Is base a FEATURE and stringi a BUG? base是 FEATURE 而stringi是 BUG 吗? Or vice versa?或相反亦然?

Should not EOS (end of string) splits behave the same as BOS (beginning of string) splits? EOS(字符串结尾)拆分的行为不应该与 BOS(字符串开头)拆分相同吗?

> R.version
               _                                
platform       x86_64-w64-mingw32               
arch           x86_64                           
os             mingw32                          
crt            ucrt                             
system         x86_64, mingw32                  
status                                          
major          4                                
minor          2.1                              
year           2022                             
month          06                               
day            23                               
svn rev        82513                            
language       R                                
version.string R version 4.2.1 (2022-06-23 ucrt)
nickname       Funny-Looking Kid            

and

> packageVersion("stringi")
[1] ‘1.7.8’
> 

Well, I would say that the stringi behavior is the one at least I'd expect (and there you have the option to discard empty strings by setting omit_empty = TRUE ).好吧,我会说stringi行为至少是我所期望的(您可以选择通过设置omit_empty = TRUE来丢弃空字符串)。

However, since base::strsplit clearly documents the behavior it is also a "feature".然而,由于base::strsplit清楚地记录了行为,它也是一个“特性”。 From ?strsplit :?strsplit

Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is '""', but if there is a match at the end of the string, the output is the same as [the input] with the match removed.请注意,这意味着如果在(非空)字符串的开头有匹配项,则 output 的第一个元素是 '""',但如果在字符串的末尾有匹配项,则 output 是与删除匹配项的 [输入] 相同。

stringi provides a much more configurable interface at the expense of another dependency. stringi以另一个依赖项为代价提供了一个更加可配置的接口。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM