繁体   English   中英

R function 用于删除列中除某个字符串之外的所有内容

[英]R function for removing everyhting but a certain string in a column

我在尝试隔离列中的国家/地区名称时遇到了问题。 国家名称包含国家名称,后跟其坐标,如下所示:

1
CBD, Sydney.mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}30°50′30″N 29°39′50″E / 30.84167°N 29.66389°E / 30.84167; 29.66389 (Abu Mena)

我想删除除 Country 部分之外的所有内容,但我不知道如何。 我只有基本的 R package、Dyplr、Rvest、Regex 和 Textreadr 供我使用。 我想使用正则表达式,因为我对它的经验很少,想学习如何谢谢!

更多示例来概括正则表达式模式会有所帮助。 这适用于您共享的一个示例。

x <- "Sydney, Australia, 65(degrees),20',30'N,78(degrees),45',87\"E"
sub('.*?, (.*?), \\d+\\(degrees\\).*', '\\1', x)
#[1] "Australia"

或者甚至只是为了在第一个逗号之后获取文本。

sub('.*?, (.*?),.*', '\\1', x)
#[1] "Australia"

对于更新的示例,这似乎有效:

sub('.*?,\\s+(.*?)(\\d+|\\.).*', '\\1', df$country)
#[1] "Egypt"     "Niger"     "Syria"     "Syria"     "Syria"     "Venezuela" "Peru"  

数据

如果您以更易于复制的可重现方式(如下面使用dput )共享数据,则更容易提供帮助。

df <- structure(list(country = c("EgyAbusir, Egypt.mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}30°50′30″N 29°39′50″E / 30.84167°N 29.66389°E / 30.84167; 29.66389 (Abu Mena)", 
"Niger1Arlit Department, Niger18°17′N 8°0′E / 18.283°N 8.000°E / 18.283; 8.000 (Air and Ténéré Natural Reserves)", 
"Aleppo Governorate,  Syria36°14′N 37°10′E / 36.233°N 37.167°E / 36.233; 37.167 (Ancient City of Aleppo)", 
"Daraa Governorate,  Syria32°31′5″N 36°28′54″E / 32.51806°N 36.48167°E / 32.51806; 36.48167 (Ancient City of Bosra)", 
"Damascus Governorate,  Syria33°30′41″N 36°18′23″E / 33.51139°N 36.30639°E / 33.51139; 36.30639 (Ancient City of Damascus)", 
"VenFalcón, Venezuela11°25′N 69°40′W / 11.417°N 69.667°W / 11.417; -69.667 (Coro and its Port)", 
"PerLa Libertad, Peru8°6′40″S 79°4′30″W / 8.11111°S 79.07500°W / -8.11111; -79.07500 (Chan Chan Archaeological Zone)"
)), class = "data.frame", row.names = c(NA, -7L))

如果国家总是在第一个逗号之后,这有效:

library(stringr)
str_extract(df$country, "(?<=,\\s{1,10})[A-Za-z]+(?=[^A-Za-z])")
[1] "Egypt"     "Niger"     "Syria"     "Syria"     "Syria"     "Venezuela" "Peru"

@Ronak 的数据:

df <- structure(list(country = c("EgyAbusir, Egypt.mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}30°50′30″N 29°39′50″E / 30.84167°N 29.66389°E / 30.84167; 29.66389 (Abu Mena)", 
                                 "Niger1Arlit Department, Niger18°17′N 8°0′E / 18.283°N 8.000°E / 18.283; 8.000 (Air and Ténéré Natural Reserves)", 
                                 "Aleppo Governorate,  Syria36°14′N 37°10′E / 36.233°N 37.167°E / 36.233; 37.167 (Ancient City of Aleppo)", 
                                 "Daraa Governorate,  Syria32°31′5″N 36°28′54″E / 32.51806°N 36.48167°E / 32.51806; 36.48167 (Ancient City of Bosra)", 
                                 "Damascus Governorate,  Syria33°30′41″N 36°18′23″E / 33.51139°N 36.30639°E / 33.51139; 36.30639 (Ancient City of Damascus)", 
                                 "VenFalcón, Venezuela11°25′N 69°40′W / 11.417°N 69.667°W / 11.417; -69.667 (Coro and its Port)", 
                                 "PerLa Libertad, Peru8°6′40″S 79°4′30″W / 8.11111°S 79.07500°W / -8.11111; -79.07500 (Chan Chan Archaeological Zone)"
)), class = "data.frame", row.names = c(NA, -7L))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM