简体   繁体   English

提取data.table中的字符串

[英]Extract string within data.table

I have the following data.table called D . 我有以下data.table称为D

                  ngram        
1          in_the_years          
2  the_years_thereafter        
3 years_thereafter_most 
4        he_wasn't_home        
5           how_are_you          
6    thereafter_most_of  

I need to add a few variables. 我需要添加一些变量。

1. queryWord (the requirement is to extract the first 2 words) the following is my code 1. queryWord (要求提取前两个单词)以下是我的代码

D[,queryWord:=strsplit(ngram,"_[^_]+$")[[1]],by=ngram]
                   ngram        queryWord
1          in_the_years           in_the
2  the_years_thereafter        the_years
3 years_thereafter_most years_thereafter
4        he_wasn't_home        he_wasn't
5           how_are_you          how_are
6    thereafter_most_of  thereafter_most

2. predict . 2. predict The requirement is to extract the last word. 要求是提取最后一个单词。 The following is desired output 以下是所需的输出

                   ngram        queryWord            predict
1          in_the_years           in_the             years
2  the_years_thereafter        the_years             thereafter
3 years_thereafter_most        years_thereafter      most
4        he_wasn't_home        he_wasn't             home 
5           how_are_you          how_are             you
6    thereafter_most_of  thereafter_most             of

For this purpose I wrote the following function 为此,我编写了以下函数

getLastTerm<-function(x){
              y<-strsplit(x,"_")
              y[[1]][length(y[[1]])]
}

getLasTerm("in_the_years","_") return "years" however is not working inside the data.table object D . getLasTerm("in_the_years","_")返回"years"但是在data.table对象D内部data.table

D[,predict:=getLastTerm(ngram)[[1]],by=ngram] 

Please I need help 请帮忙

Your get last term function only selects the first list. 您的获取最后一项功能仅选择第一个列表。 Try below. 请尝试以下。

getLastTerm <- function(x){
  y <- strsplit(x,"_")

  for (i in (1:6)) { 
    x[i] <- y[[i]][length(y[[i]])]
  }
  x
}


D$new <- getLastTerm(D$ngram)

Before adressing your actual question, you can simplify your first step to: 在解决实际问题之前,您可以将第一步简化为:

# option 1
D[, queryWord := strsplit(ngram,"_[^_]+$")][]
# option 2
D[, queryWord := sub('(.*)_.*$','\\1',ngram)][]

To get the predict -column, you don't need to write a special function. 要获取predict列,您无需编写特殊函数。 Using a combination of strsplit , lapply and last : 使用strsplitlapplylast的组合:

D[, predict := lapply(strsplit(D$ngram,"_"), last)][]

Or an even easier solution is using only sub : 或者更简单的解决方案是仅使用sub

D[, predict := sub('.*_(.*)$','\\1',ngram)][]

Both approaches give the following final result: 两种方法均得出以下最终结果:

 > D ngram queryWord predict 1: in_the_years in_the years 2: the_years_thereafter the_years thereafter 3: years_thereafter_most years_thereafter most 4: he_wasn't_home he_wasn't home 5: how_are_you how_are you 6: thereafter_most_of thereafter_most of 

Used data: 使用的数据:

D <- fread("ngram
in_the_years
the_years_thereafter
years_thereafter_most
he_wasn't_home
how_are_you
thereafter_most_of", header = TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM