简体   繁体   English

在r数据框中隔离部分文本

[英]Isolating partial text in r data frame

I have an r data frame that contains US state and county names in one column. 我有一个r数据框,其中一列包含美国州和县名。 The data is in the format: 数据格式为:

United States - State name - County name

where each cell is a unique county. 每个单元格都是一个独特的县。 For example: 例如:

United States - North Carolina - Wake County
United States - North Carolina - Warren County
etc.

I need to break the column into 2 columns, one containing just the state name and the other containing just the county name. 我需要将该列分为两列,一列仅包含州名,另一列仅包含县名。 I've experimented with sub and gsub but am getting no results. 我已经尝试过sub和gsub,但是没有任何结果。 I understand this is probably a simple matter for r experts but I'm a newbie. 我了解这对于R专家来说可能是一件简单的事情,但我是新手。 I would be most grateful if anyone can point me in the right direction. 如果有人能指出正确的方向,我将不胜感激。

You can use tidyr 's separate function: 您可以使用tidyrseparate函数:

library(tidyr)
df <- separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")

If the data is as you show in your question (including United States as country) and if your data frame is called df and the current column with the data is called currentColumn. 如果数据是您在问题中所显示的(包括美国作为国家),并且数据框名为df,则包含数据的当前列称为currentColumn。

Example: 例:

df <- data.frame(currentColumn = c("United States - North Carolina - Wake County",
 "United States - North Carolina - Warren County"), val = rnorm(2))

df
#                                   currentColumn       val
#1   United States - North Carolina - Wake County 0.8173619
#2 United States - North Carolina - Warren County 0.4941976

separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
#        Country          State        County       val
#1 United States North Carolina   Wake County 0.8173619
#2 United States North Carolina Warren County 0.4941976

Using read.table , and assuming your data is in df$var 使用read.table ,并假设您的数据在df$var

read.table(text=df$var,sep="-",strip.white=TRUE,
           col.names=c("Country","State","County"))

If speed is an issue, then strsplit will be a lot quicker: 如果速度是一个问题,那么strsplit会快很多:

setNames(data.frame(do.call(rbind,strsplit(df$var,split=" - "))),
         c("Country","State","County"))

Both give: 两者都给:

#        Country          State        County
#1 United States North Carolina   Wake County
#2 United States North Carolina Warren County

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM