[英]Removing characters after space in a string - R Studio data cleaning
I am attempting to clean some data in R Studio.我正在尝试清理 R Studio 中的一些数据。
Here's an example of my data.这是我的数据示例。
LSOA name:
York 009A
Wychavon 014A
Bath and North East Somerset 001A
Aylesbury Vale 008C
Central Bedfordshire 030C
I want to be able to remove the code from the end of each.我希望能够从每个末尾删除代码。 So that the resulting data looks like this:这样生成的数据如下所示:
LSOA name:
York
Wychavon
Bath and North East Somerset
Aylesbury Vale
Central Bedfordshire
I am quite new to regex so finding this quite difficult.我对正则表达式很陌生,所以发现这很困难。 From what I can tell, as there is a variable number of words before the code, a simple remove characters after a whitespace is not possible.据我所知,由于代码前有可变数量的单词,因此不可能在空格后简单地删除字符。
Any help would be hugely appreciated!任何帮助将不胜感激!
We can use sub
to match one or more spaces followed by one or more digits ( \\d+
) and an upper case letter ( [AZ]
) at the end ( $
) of the string and replace it with blank ( ""
)我们可以使用sub
匹配一个或多个空格后跟一个或多个数字( \\d+
)和字符串末尾( $
)的大写字母( [AZ]
),并将其替换为空白( ""
)
df1$name <- sub("\\s+\\d+[A-Z]$", "", df1$name)
-output -输出
df1
# name
#1 York
#2 Wychavon
#3 Bath and North East Somerset
#4 Aylesbury Vale
#5 Central Bedfordshire
df1 <- structure(list(name = c("York 009A", "Wychavon 014A",
"Bath and North East Somerset 001A",
"Aylesbury Vale 008C", "Central Bedfordshire 030C")), class = "data.frame",
row.names = c(NA,
-5L))
You can also use lookahead (?=\\s\\d+)
and backreference \\1
:您还可以使用前瞻(?=\\s\\d+)
和反向引用\\1
:
sub("(.*)(?=\\s\\d+).*", "\\1", df1$name, perl = T)
[1] "York" "Wychavon" "Bath and North East Somerset" "Aylesbury Vale"
[5] "Central Bedfordshire"
Another option is str_extract
and the nagative character class \\D
, which matches any char that is not a digit ( trimws
removes the whitespace).另一个选项是str_extract
和负字符 class \\D
,它匹配任何不是数字的字符( trimws
删除空格)。
library(stringr)
trimws(str_extract(df1$name, "\\D+"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.