如何使用R删除字符串中其他两个重复出现的字符之间的所有字符？

Question

The following code successfully gets me the text I need before using gsub to help "clean." 以下代码在使用gsub帮助“清除”之前成功获取了我所需的文本。

am1<-getURL("url.com")
ami1<-htmlTreeParse(am1, useInternalNodes = TRUE)
ami1.tree.parse<- unlist(xpathApply(ami1, path = '//td', fun = xmlValue))
ami1.txt<-NULL
  for (i in 2:(length(ami1.tree.parse)-1)) {
    ami1.txt<-paste(ami1.txt, as.character(ami1.tree.parse[i]), sep = ' ')
  }

The Issue 问题

I'm not being able to delete the entirety of questions within the interview text. 我无法删除采访文本中的全部问题。 For example, the text looks like: 例如，文本如下所示：

[1] "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."

And for formatting's sake: 并且为了格式化的缘故：

"Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively." “问：你认为婚姻中的情况如何？乔·史密斯：一切都很好。问：五年后你会在哪里看到自己？乔·史密斯：我可能会搬到洛杉矶开始演戏。问：好的。您如何看待妻子对您的想法的看法？乔伊·史密斯：我想她会做出积极回应。”

To be absolutely clear, what I need from the text above is: 绝对要清楚，我从上面的文本中需要的是：

[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."

"It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively." “一切都很好。我可能会搬到洛杉矶开始演戏。我认为她会做出积极回应。”

I've tried: 我试过了：

 ami1.txt<-gsub("Q.[^?]+H:", "",ami1.txt)
 ami1.txt<-gsub("Q.[^?]+H: ", "",ami1.txt)
 ami1.txt<-gsub("Q.*H:", "",ami1.txt)

It comes down to me not grasping regex surely, but I'd greatly appreciate if someone could point me in the right direction. 这归结于我不能确切把握正则表达式，但是如果有人可以指出我正确的方向，我将不胜感激。

Alas I've lied, the text is apparently a tad more complicated. las，我撒谎了，文字显然有点复杂。 I've added the more complicated element to the end of the above text, below. 我在下面的上方添加了更复杂的元素。 Some "questions" (Q.) start with a sentence: 一些“问题”（问）以一个句子开头：

 str2<-"Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively.Q. That's interesting. When would you consider speaking to her?JOE SMITH: Probably, tomorrow. Q. That sounds good. How do you feel now? Better than before?JOE SMITH: Yeah I'm feeling alright."

Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. 问：您认为婚姻中的状况如何？乔·史密斯：一切都很好。 Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. 问：五年后你会在哪里看到自己？乔·史密斯：我可能会搬到洛杉矶开始演艺。 Okay. 好的。 How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively.Q. 问：您认为您的妻子对您的想法有何看法？史密斯：我认为她会积极回应。 That's interesting. 那很有意思。 When would you consider speaking to her?JOE SMITH: Probably, tomorrow. 你什么时候考虑和她说话？乔·史密斯：大概是明天。 Q. That sounds good. 问：听起来不错。 How do you feel now? 你现在感觉怎么样？ Better than before?JOE SMITH: Yeah I'm feeling alright. JOE SMITH：是的，我感觉还不错。

Task remains the same, and akrun's answer gets me close: 任务保持不变，而akrun的答案使我接近：

 trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str2))
 print(str2)
 [1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively. Probably, tomorrow.  Better than before? Yeah I'm feeling alright."

[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively. Probably, tomorrow. Better than before? Yeah I'm feeling alright." [1]“一切都很好。我可能会搬到洛杉矶开始演戏。我认为她会做出积极回应。可能是明天。比以前更好？是的，我感觉还好。”

Final Update 最终更新

Akrun's answer: 阿克伦的答案：

 trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str2))

I'm not totally sure why the above answer wasn't fully deleting everything in between the "Q" and last question mark, but alas. 我不太确定为什么上面的答案没有完全删除“ Q”和最后一个问号之间的所有内容，但是a。 After the revisions to my question above, I figured what I was actually looking for was for everything from "Q" to the ":" to be deleted. 修改完上述问题后，我发现实际上需要查找的是从“ Q”到“：”的所有内容都将被删除。 So I used this tool to help me understand what was wrong with my understanding of regex. 因此，我使用此工具来帮助我了解对正则表达式的理解出了什么问题。 I got to the following to wipe out all characters in between "Q" and the ":". 我下面将擦除“ Q”和“：”之间的所有字符。

 gsub("Q[^:]+\\?|[A-Z ]+:", "", str2)

Answer 1

We could match the characters that start with Q followed by characters that are not a ? 我们可以匹配以Q开头的字符，然后是不是?字符? ( [^?] ) followed by a question mark or ( | ) upper case letters followed by a : and replace it with blanks. （ [^?] ）后跟问号或（ | ）大写字母，后跟:然后将其替换为空格。 If there are leading/lagging spaces, use trimws 如果有前导/滞后空格，请使用trimws

trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str1))
#[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."

data 数据

str1 <- "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."

如何使用R删除字符串中其他两个重复出现的字符之间的所有字符？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-14 00:48:27

data 数据

如何使用R删除字符串中其他两个重复出现的字符之间的所有字符？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-14 00:48:27

data 数据

解决方案1
0 已采纳 2018-12-14 00:48:27