[英]Using R, how does strsplit work on fixed elements with the splitter at the end of the string to split?
I was working on a language parser and I wanted to count certain string elements (say "</i>"
) in a larger string.我正在研究一个语言解析器,我想在一个更大的字符串中计算某些字符串元素(比如
"</i>"
)。 Since the string has been cleansed ( str.trim
), it doesn't have any content after it.由于字符串已被清理(
str.trim
),它后面没有任何内容。 I was getting some weird behavior on strsplit
as it seems to behave differently if the separator sep
(called split
in RTM) is at the beginning or end of the string.我在
strsplit
上遇到了一些奇怪的行为,因为如果分隔符sep
(在 RTM 中称为split
)位于字符串的开头或结尾,它的行为似乎有所不同。
Below is an example:下面是一个例子:
str1 = "<i>hello friend</i>";
str2 = paste0(" ",str1);
str3 = paste0(str1, " ");
sep1="<i>";
sep2="</i>";
str = c(str1, str2, str3); n = length(str);
sep = c(sep1, sep2); ns = length(sep);
base = matrix("", nrow=n, ncol=ns);
rownames(base) = str; colnames(base) = sep;
for(i in 1:n)
{
for(j in 1:ns)
{
base[i, j] = paste0(base::strsplit(str[i], sep[j], fixed=TRUE)[[1]], collapse="|");
}
}
base;
stringi = matrix("", nrow=n, ncol=ns);
rownames(stringi) = str; colnames(stringi) = sep;
for(i in 1:n)
{
for(j in 1:ns)
{
stringi[i, j] = paste0(stringi::stri_split_fixed(str[i], sep[j])[[1]], collapse="|");
}
}
stringi;
stopifnot(identical(base,stringi));
The output for base : output 用于基础:
> base;
<i> </i>
<i>hello friend</i> "|hello friend</i>" "<i>hello friend"
<i>hello friend</i> " |hello friend</i>" " <i>hello friend"
<i>hello friend</i> "|hello friend</i> " "<i>hello friend| "
The output for stringi :用于 stringi 的output :
> stringi;
<i> </i>
<i>hello friend</i> "|hello friend</i>" "<i>hello friend|"
<i>hello friend</i> " |hello friend</i>" " <i>hello friend|"
<i>hello friend</i> "|hello friend</i> " "<i>hello friend| "
The core difference is ROW=1, COL=2...核心区别是ROW=1,COL=2...
E[strsplit]
?E[strsplit]
? Is base a FEATURE and stringi a BUG? base是 FEATURE 而stringi是 BUG 吗? Or vice versa?
或相反亦然?
Should not EOS (end of string) splits behave the same as BOS (beginning of string) splits? EOS(字符串结尾)拆分的行为不应该与 BOS(字符串开头)拆分相同吗?
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
crt ucrt
system x86_64, mingw32
status
major 4
minor 2.1
year 2022
month 06
day 23
svn rev 82513
language R
version.string R version 4.2.1 (2022-06-23 ucrt)
nickname Funny-Looking Kid
and和
> packageVersion("stringi")
[1] ‘1.7.8’
>
Well, I would say that the stringi
behavior is the one at least I'd expect (and there you have the option to discard empty strings by setting omit_empty = TRUE
).好吧,我会说
stringi
行为至少是我所期望的(您可以选择通过设置omit_empty = TRUE
来丢弃空字符串)。
However, since base::strsplit
clearly documents the behavior it is also a "feature".然而,由于
base::strsplit
清楚地记录了行为,它也是一个“特性”。 From ?strsplit
:从
?strsplit
:
Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is '""', but if there is a match at the end of the string, the output is the same as [the input] with the match removed.
请注意,这意味着如果在(非空)字符串的开头有匹配项,则 output 的第一个元素是 '""',但如果在字符串的末尾有匹配项,则 output 是与删除匹配项的 [输入] 相同。
stringi
provides a much more configurable interface at the expense of another dependency. stringi
以另一个依赖项为代价提供了一个更加可配置的接口。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.