简体   繁体   English

R-使用正则表达式和ifelse条件从字符串中分离文本时出错

[英]R - error in separating text from a string using regex and ifelse condition

What I want to do is to strip text from a string where ever there is an ":". 我想做的是从有“:”的地方从字符串中剥离文本。

Suppose my text contains: 假设我的文字包含:

 text$Text[[3]] = "There is a horror movie running in the iNox theater. : Can we go?"

And what I want to create a dataframe such that: 我想要创建一个数据框,例如:

  Col1                                                    Col2
  There is a horror movie running in the iNox theater.    Can we go?

I am trying to use the following : 我正在尝试使用以下内容:

 df = data.frame(Text = strsplit(text$Text[[3]], 
                 ifelse(":", ":", text$Text[[3]]))[[1]], stringsAsFactors = F)

dat3$Text[[3]] because the text is in row no. dat3$Text[[3]]因为文本在行号中。 3 of the text$Text. 文本$ Text中的3。

But the above ifelse() logic did not work. 但是上述ifelse()逻辑无效。 Here I was trying to use ifelse condition such that if there is a ":" in the text, use ":" otherwise use the complete text as it is. 在这里,我尝试使用ifelse条件,以便如果文本中包含“:”,请使用“:”,否则请使用完整的文本。 So it means in case if there is no ":" then the result would look like something: 因此,这意味着如果没有“:”,则结果将类似于以下内容:

 text$Text[[3]] = "Hi Mom, You there. Can I go to Jimmy's house?"

 Col1                                                 Col2
 Hi Mom, You there. Can I go to Jimmy's house?         NA

How to do it correctly? 如何正确做?

Please note that there is a catch: 请注意有一个陷阱:

  • What if there are two ":" in the text?? 如果文本中有两个“:”怎么办?
  • I would like to consider only that ":" which is within first two lines and not in the rest of the text? 我只想考虑在前两行中而不是在文本其余部分中的“:”?

I find the following too complicated, someone with more knowledge than me on regular expressions will surely come up with a better solution. 我发现以下内容太复杂了,比我更了解正则表达式的人一定会提出更好的解决方案。

test <- c(
"There is a horror movie running in the iNox theater. : Can we go?",
"Hi Mom, You there. Can I go to Jimmy's house?",
"Hi : How are you : Lets go")

fun <- function(x, pattern = ":"){
    re <- regexpr(pattern, x)
    res <- sapply(seq_along(re), function(i){
        if(re[i] > 0){
            Col1 <- trimws(substring(x[i], 1, re[i] - 1))
            Col2 <- trimws(substring(x[i], re[i] + 1))
        } else {
            Col1 <- x[i]
            Col2 <- NA
        }
        c(Col1 = Col1, Col2 = Col2)
    })
    as.data.frame(t(res))
}

fun(test)

You don't really need an if else statement for this. 您实际上不需要if语句。 Regex is built to handle conditions like this. 正则表达式旨在处理此类情况。

For the first case of data with just one symbol – a colon (":") in this example – we can use this: 对于只有一个符号的数据的第一种情况-在此示例中为冒号(“:”)–我们可以使用以下代码:

x <- "There is a horror movie running in the iNox theater. : Can we go?"

data.frame(Col1=gsub("(.*)+\\s[:]\\s+(.*)","\\1",x), 
           Col2=gsub("(.*)+\\s[:]\\s+(.*)","\\2",x))

Output: 输出:

                                                  Col1            Col2
1 There is a horror movie running in the iNox theater.      Can we go?

Now let's say you have more than one symbol in your string and you want to be able to keep information before the first symbol in the first column, and information after the first symbol in the second column. 现在,假设您的字符串中有多个符号,并且希望能够将信息保留在第一列的第一个符号之前,并将信息保留在第二列的第一个符号之后。 To do this, try using the "?" 为此,请尝试使用“?” regex symbol, like this: 正则表达式符号,如下所示:

x <- "There is a horror movie running in the iNox theater. : Can we go? : Please?"

data.frame(Col1=gsub("\\s\\:.*$","\\1",x), 
           Col2=gsub("^[^:]+(?:).\\s","\\1",x))

Output: 输出:

                                                  Col1                      Col2
1 There is a horror movie running in the iNox theater.      Can we go? : Please?

For more information on using regex symbols in R, this is a helpful reference . 有关在R中使用正则表达式符号的更多信息, 这是一个有用的参考

test <- "There is a horror movie running in the iNox theater. : Can we go?"
df = data.frame(Col1 = strsplit(test,":")[[1]][1],
                Col2 = strsplit(test,":")[[1]][2],
                stringsAsFactors = F)
df
#                                                   Col1        Col2
#1 There is a horror movie running in the iNox theater.   Can we go?

Notice that the unusual first line of strsplit()'s output consists of [[1]]. 请注意,strsplit()输出的异常第一行由[[1]]组成。 Similar to the way that R displays vectors, [[1]] means that R is showing the first element of a list. 与[R]显示向量的方式类似,[[1]]表示R正在显示列表的第一个元素。

You can use the package stringr 您可以使用包纵梁

library(stringr) 
str_split_fixed("HI : How are you : Lets go", ":", 3)

In the above function str_split_fixed "Hi : How are you : Lets go" is the sentence or string you want to use and ":" is the seperator in the string , and 3 is the number of columns you want the string to be split into 在上面的函数str_split_fixed中, “嗨:您好:如何放手”是您要使用的句子或字符串, “:”是字符串中的分隔符,而3是您希望将字符串拆分为的列数

In your case last value should be 2 , as you want to split into 2 columns 在您的情况下,最后一个值应为2,因为您想分成两列

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM