提取data.table中的字符串

Question

I have the following data.table called D . 我有以下data.table称为D 。

                  ngram        
1          in_the_years          
2  the_years_thereafter        
3 years_thereafter_most 
4        he_wasn't_home        
5           how_are_you          
6    thereafter_most_of

I need to add a few variables. 我需要添加一些变量。

1. queryWord (the requirement is to extract the first 2 words) the following is my code 1. queryWord （要求提取前两个单词）以下是我的代码

D[,queryWord:=strsplit(ngram,"_[^_]+$")[[1]],by=ngram]
                   ngram        queryWord
1          in_the_years           in_the
2  the_years_thereafter        the_years
3 years_thereafter_most years_thereafter
4        he_wasn't_home        he_wasn't
5           how_are_you          how_are
6    thereafter_most_of  thereafter_most

2. predict . 2. predict 。 The requirement is to extract the last word. 要求是提取最后一个单词。 The following is desired output 以下是所需的输出

                   ngram        queryWord            predict
1          in_the_years           in_the             years
2  the_years_thereafter        the_years             thereafter
3 years_thereafter_most        years_thereafter      most
4        he_wasn't_home        he_wasn't             home 
5           how_are_you          how_are             you
6    thereafter_most_of  thereafter_most             of

For this purpose I wrote the following function 为此，我编写了以下函数

getLastTerm<-function(x){
              y<-strsplit(x,"_")
              y[[1]][length(y[[1]])]
}

getLasTerm("in_the_years","_") return "years" however is not working inside the data.table object D . getLasTerm("in_the_years","_")返回"years"但是在data.table对象D内部data.table 。

D[,predict:=getLastTerm(ngram)[[1]],by=ngram]

Please I need help 请帮忙

Answer 1

Your get last term function only selects the first list. 您的获取最后一项功能仅选择第一个列表。 Try below. 请尝试以下。

getLastTerm <- function(x){
  y <- strsplit(x,"_")

  for (i in (1:6)) { 
    x[i] <- y[[i]][length(y[[i]])]
  }
  x
}


D$new <- getLastTerm(D$ngram)

Answer 2

Before adressing your actual question, you can simplify your first step to: 在解决实际问题之前，您可以将第一步简化为：

# option 1
D[, queryWord := strsplit(ngram,"_[^_]+$")][]
# option 2
D[, queryWord := sub('(.*)_.*$','\\1',ngram)][]

To get the predict -column, you don't need to write a special function. 要获取predict列，您无需编写特殊函数。 Using a combination of strsplit , lapply and last : 使用strsplit ， lapply和last的组合：

D[, predict := lapply(strsplit(D$ngram,"_"), last)][]

Or an even easier solution is using only sub : 或者更简单的解决方案是仅使用sub ：

D[, predict := sub('.*_(.*)$','\\1',ngram)][]

Both approaches give the following final result: 两种方法均得出以下最终结果：

 > D ngram queryWord predict 1: in_the_years in_the years 2: the_years_thereafter the_years thereafter 3: years_thereafter_most years_thereafter most 4: he_wasn't_home he_wasn't home 5: how_are_you how_are you 6: thereafter_most_of thereafter_most of

Used data: 使用的数据：

D <- fread("ngram
in_the_years
the_years_thereafter
years_thereafter_most
he_wasn't_home
how_are_you
thereafter_most_of", header = TRUE)

提取data.table中的字符串

问题描述

2 个解决方案

解决方案1
0 2018-03-11 18:58:41

解决方案2
0 2018-03-11 21:09:44

提取data.table中的字符串

问题描述

2 个解决方案

解决方案1 0 2018-03-11 18:58:41

解决方案2 0 2018-03-11 21:09:44

解决方案1
0 2018-03-11 18:58:41

解决方案2
0 2018-03-11 21:09:44