简体   繁体   English

R 中的向量列表 - 提取向量的一个元素

[英]list of vectors in R - extract an element of the vectors

I have a list which contains some texts.我有一个包含一些文本的列表。 So each element of the list is a text.所以列表的每个元素都是一个文本。 And a text is a vector of words.文本是单词的向量。 So I have a list of vectors.所以我有一个向量列表。 I am doing some text-mining on that.我正在对此进行一些文本挖掘。 Now, I'm trying to extract the words that are after the word "no".现在,我正在尝试提取单词“no”之后的单词。 I transformed my vectors, so now they are vectors of two words.我转换了我的向量,所以现在它们是两个词的向量。 Such as: list(c("want friend", "friend funny", "funny nice", "nice glad", "glad become", "become no", "no more", "more guys"), c("no comfort", "comfort written", "written conduct","conduct prevent", "prevent manners", "matters no", "no one", "one want", "want be", "be fired"))如: list(c("want friend", "friend funny", "funny nice", "nice glad", "glad become", "become no", "no more", "more guys"), c("no comfort", "comfort written", "written conduct","conduct prevent", "prevent manners", "matters no", "no one", "one want", "want be", "be fired"))

My aim is to have a list of vectors which will be like: list(c("more"), c("comfort", "one")) So I would be able to see for a text i the vectoe of results by liste[i].我的目标是有一个向量列表,如下所示: list(c("more"), c("comfort", "one"))所以我可以通过列表[i]。

So I have a formula to extract the word after "no" (in the first vector it will be "more").所以我有一个公式可以在“no”之后提取单词(在第一个向量中它将是“more”)。 But when I have several "no" in my text it doesn't work.但是当我的文字中有几个“不”时,它就不起作用了。

Here is my code:这是我的代码:

liste_negation <- vector(length = length(data))
for (i in 1:length(data)){
  for (j in 1:length(data[[i]])){
    if (startsWith((data[[i]])[[j]], 'no') == TRUE){
      liste_neg[i] <- c(liste_neg[i], tail(strsplit((data[[i]])[[j]],split=" ")[[1]],1))
    } else{
      liste_neg[i] <- c(liste_neg[i])
    }
    liste_negation[[i]] <- c(liste_neg[[i]])
  }
}

That one works for a vector when there is only one "no":当只有一个“否”时,该向量适用于向量:

data <- list(c("want friend", "friend funny", "funny nice", "nice glad", "glad become", "become no", "no more", "more guys"), c("no comfort", "comfort written", "written conduct","conduct prevent", "prevent manners", "matters no", "no one", "one want", "want be", "be fired"))
data

liste_neg <- c()
liste_negation <- vector(length = length(data))
if (startsWith((data[[1]])[[9]], 'no') == TRUE){
  liste_neg[1] <- c(liste_neg[1], tail(strsplit((data[[1]])[[9]],split=" ")[[1]],1))
}

liste_negation[[1]] <- c(liste_neg[[1]])

But if I try to adapt it with a loop to see each element of the vector, and there are more than one "no" in the text, it doesn't work.但是,如果我尝试使用循环对其进行调整以查看向量的每个元素,并且文本中存在多个“否”,则它不起作用。

Code:代码:

liste_neg <- c()
liste_negation <- vector(length = length(data))
for (j in 1:length(data[[2]])){
  if (startsWith((data[[2]])[[j]], 'no') == TRUE){
    liste_neg[2] <- append(liste_neg[2], tail(strsplit((data[[2]])[[j]],split=" ")[[1]],1))
  }
}
liste_neg
liste_negation[[2]] <- c(liste_neg[[2]])
liste_negation

Warning message:警告信息:

Warning message:
In liste_neg[2] <- append(liste_neg[2], tail(strsplit((data[[2]])[[j]],  :
  number of items to replace is not a multiple of replacement length
> liste_neg
[1] NA        "comfort"
> liste_negation[[2]] <- c(liste_neg[[2]])
> liste_negation
[1] "FALSE"   "comfort"

As you can see I have only the second word which is there.如您所见,我只有第二个单词。

I tried many things and I tried to split the code and run it and work on it piece by piece, but after spending all the morning on it I haven't found a solution..我尝试了很多东西,我尝试拆分代码并运行它并逐个处理它,但是在花了整个上午之后我还没有找到解决方案..

Did someone have an idea top help me?有人有什么想法可以帮助我吗?

Thank you in advance (and sorry for my english, I'm french ^^')提前谢谢你(对不起我的英语,我是法国人^^')

In base R, we can use sapply to loop over list and grep to identify words with "no"在基础 R 中,我们可以使用sapply循环遍历列表,使用grep来识别带有"no"的单词

output <- sapply(word_vec, function(x) sub(".*no", "", grep("\\bno\\b", x, value = TRUE)))

#[[1]]
#[1] ""      " more"

#[[2]]
#[1] " comfort" ""         " one" 

If you don't need empty string you can remove them to get如果您不需要空字符串,您可以删除它们以获取

sapply(output, function(x) trimws(x[x!= ""]))  
#[[1]]
#[1] "more"

#[[2]]
#[1] "comfort" "one"     
lapply(data, function(x) substr(x[startsWith(x, "no")], 4, 1000))


[[1]]
[1] "more"

[[2]]
[1] "comfort" "one"    

You could use regular expressions with capture group to obtain all substrings that match the desired pattern, then extract just the captured group as follows:您可以使用带有捕获组的正则表达式来获取与所需模式匹配的所有子字符串,然后仅提取捕获的组,如下所示:

# regex for strings that start with "no " and have any text after that
r <- '^no (.*)'
lapply(data, function(x) gsub(r, '\\1', regmatches(x, regexpr(r, x))))

#output
[[1]]
[1] "more"

[[2]]
[1] "comfort" "one"  

regexpr returns a match object that regmatches will extract the matching strings from, and gsub uses the \\1 argument to extract the first captured group. regexpr返回匹配 object , regmatches将从中提取匹配字符串, gsub使用\\1参数提取第一个捕获的组。

Steps to extract the word after "no":提取“no”之后的单词的步骤:

  • First of all, use grep(i,pattern = "^no",value = T) to get the texts which start with "no".首先,使用grep(i,pattern = "^no",value = T)来获取以 "no" 开头的文本。

  • gsub(pattern = "no ",replacement = "") replace "no " into "". gsub(pattern = "no ",replacement = "")将 "no" 替换为 ""。

then you can extract the word after "no".然后您可以提取“否”之后的单词。

  • lapply() can split list and apply the steps to the elements of list. lapply()可以拆分列表并将步骤应用于列表的元素。

  • %>% the pipe operator can make code clear and take the result of grep() into gsub() . %>% pipe 运算符可以使代码清晰,并将grep()的结果放入gsub()

library(magrittr)   
lapply(data,function(i)grep(i,pattern = "^no",value = T) %>% gsub(pattern = "no ",replacement = ""))
#[[1]]
#[1] "more"
#    
#[[2]]
#[1] "comfort" "one" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM