[英]Extract string within data.table
I have the following data.table
called D
. 我有以下data.table
称为D
。
ngram
1 in_the_years
2 the_years_thereafter
3 years_thereafter_most
4 he_wasn't_home
5 how_are_you
6 thereafter_most_of
I need to add a few variables. 我需要添加一些变量。
1. queryWord
(the requirement is to extract the first 2 words) the following is my code 1. queryWord
(要求提取前两个单词)以下是我的代码
D[,queryWord:=strsplit(ngram,"_[^_]+$")[[1]],by=ngram]
ngram queryWord
1 in_the_years in_the
2 the_years_thereafter the_years
3 years_thereafter_most years_thereafter
4 he_wasn't_home he_wasn't
5 how_are_you how_are
6 thereafter_most_of thereafter_most
2. predict
. 2. predict
。 The requirement is to extract the last word. 要求是提取最后一个单词。 The following is desired output 以下是所需的输出
ngram queryWord predict
1 in_the_years in_the years
2 the_years_thereafter the_years thereafter
3 years_thereafter_most years_thereafter most
4 he_wasn't_home he_wasn't home
5 how_are_you how_are you
6 thereafter_most_of thereafter_most of
For this purpose I wrote the following function 为此,我编写了以下函数
getLastTerm<-function(x){
y<-strsplit(x,"_")
y[[1]][length(y[[1]])]
}
getLasTerm("in_the_years","_")
return "years"
however is not working inside the data.table
object D
. getLasTerm("in_the_years","_")
返回"years"
但是在data.table
对象D
内部data.table
。
D[,predict:=getLastTerm(ngram)[[1]],by=ngram]
Please I need help 请帮忙
Your get last term function only selects the first list. 您的获取最后一项功能仅选择第一个列表。 Try below. 请尝试以下。
getLastTerm <- function(x){
y <- strsplit(x,"_")
for (i in (1:6)) {
x[i] <- y[[i]][length(y[[i]])]
}
x
}
D$new <- getLastTerm(D$ngram)
Before adressing your actual question, you can simplify your first step to: 在解决实际问题之前,您可以将第一步简化为:
# option 1
D[, queryWord := strsplit(ngram,"_[^_]+$")][]
# option 2
D[, queryWord := sub('(.*)_.*$','\\1',ngram)][]
To get the predict
-column, you don't need to write a special function. 要获取predict
列,您无需编写特殊函数。 Using a combination of strsplit
, lapply
and last
: 使用strsplit
, lapply
和last
的组合:
D[, predict := lapply(strsplit(D$ngram,"_"), last)][]
Or an even easier solution is using only sub
: 或者更简单的解决方案是仅使用sub
:
D[, predict := sub('.*_(.*)$','\\1',ngram)][]
Both approaches give the following final result: 两种方法均得出以下最终结果:
> D ngram queryWord predict 1: in_the_years in_the years 2: the_years_thereafter the_years thereafter 3: years_thereafter_most years_thereafter most 4: he_wasn't_home he_wasn't home 5: how_are_you how_are you 6: thereafter_most_of thereafter_most of
Used data: 使用的数据:
D <- fread("ngram
in_the_years
the_years_thereafter
years_thereafter_most
he_wasn't_home
how_are_you
thereafter_most_of", header = TRUE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.