[英]Extracting first names in R
Say I have a vector of peoples' names in my dataframe: 假设我的数据框中有一个人名的向量:
names <- c("Bernice Ingram", "Dianna Dean", "Philip Williamson", "Laurie Abbott",
"Rochelle Price", "Arturo Fisher", "Enrique Newton", "Sarah Mann",
"Darryl Graham", "Arthur Hoffman")
I want to create a vector with the first names. 我想创建一个带有名字的向量。 All I know about them is that they come first in the vector above and that they're followed by a space.
我所知道的只是他们在上面的向量中首先出现,然后是一个空格。 In other words, this is what I'm looking for:
换句话说,这就是我正在寻找的:
"Bernice" "Dianna" "Philip" "Laurie" "Rochelle"
"Arturo" "Enrique" "Sarah" "Darryl" "Arthur"
I've found a similar question here , but the answers (especially this one ) haven't helped much. 我在这里发现了一个类似的问题,但答案(特别是这一个 )并没有多大帮助。 So far, I've tried a couple of variations of function from the
grep
family , and the closest I could get to something useful was by running strsplit(names, " ")
to separate first names and then strsplit(names, " ")[[1]][1]
to get just the first name of the first person. 到目前为止,我已经尝试了
grep
系列中的几个函数变体,并且最接近我可以获得有用的东西是通过运行strsplit(names, " ")
来分隔名字,然后是strsplit(names, " ")[[1]][1]
只获得第一个人的名字。 I've been trying to tweak this last command to give me a whole vector of first names, to no avail. 我一直试图调整这最后一个命令给我一个完整的名字矢量,但无济于事。
Use sapply
to extract the first name: 使用
sapply
提取名字:
> sapply(strsplit(names, " "), `[`, 1)
[1] "Bernice" "Dianna" "Philip" "Laurie" "Rochelle" "Arturo" "Enrique"
[8] "Sarah" "Darryl" "Arthur"
Some comments: 一些评论:
The above works just fine. 以上工作就好了。 To make it a bit more general you could change the
split
parameter in strsplit
function from " "
in "\\\\s+"
which covers multiple spaces. 为了使它更普遍一点你可以改变
split
参数strsplit
功能从" "
在"\\\\s+"
涵盖多个空格。 Then you also could use gsub
to extract directly everything before a space. 然后你也可以使用
gsub
直接提取空间之前的所有内容。 This last approach will use only one function call and likely to be faster (but I haven't check with benchmark). 最后一种方法只使用一个函数调用,并且可能更快(但我没有检查基准)。
For what you want, here's a pretty unorthodox way to do it: 对于你想要的,这是一个非常非正统的方法:
read.table(text = names, header = FALSE, stringsAsFactors=FALSE, fill = TRUE)[[1]]
# [1] "Bernice" "Dianna" "Philip" "Laurie" "Rochelle" "Arturo" "Enrique" "Sarah"
# [9] "Darryl" "Arthur"
This seems to work: 这似乎有效:
unlist(strsplit(names,' '))[seq(1,2*length(names),2)]
Assuming no first/last names have spaces in them. 假设没有名字/姓氏在其中有空格。
Using regexpr on gsub
在
gsub
上使用regexpr
> gsub("^(.*?)\\s.*", "\\1", names)
[1] "Bernice" "Dianna" "Philip" "Laurie" "Rochelle" "Arturo" "Enrique" "Sarah"
[9] "Darryl" "Arthur"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.