简体   繁体   中英

R function for removing everyhting but a certain string in a column

I've run into a problem with trying to isolate a Country name in a Column. The country name contains the name of the country followed by its coordinates like this:

1
CBD, Sydney.mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}30°50′30″N 29°39′50″E / 30.84167°N 29.66389°E / 30.84167; 29.66389 (Abu Mena)

I would like to remove everything but the Country part but I don't know how. I only have the base R package, Dyplr, Rvest, Regex and Textreadr at my disposal. I'd like to use Regex for it as I have very little experience with it and would like to learn how Thanks!

A few more examples to generalize the regex pattern would be helpful. This works for one example that you have shared.

x <- "Sydney, Australia, 65(degrees),20',30'N,78(degrees),45',87\"E"
sub('.*?, (.*?), \\d+\\(degrees\\).*', '\\1', x)
#[1] "Australia"

Or maybe even just to get text after first comma.

sub('.*?, (.*?),.*', '\\1', x)
#[1] "Australia"

For the updated example this seems to work:

sub('.*?,\\s+(.*?)(\\d+|\\.).*', '\\1', df$country)
#[1] "Egypt"     "Niger"     "Syria"     "Syria"     "Syria"     "Venezuela" "Peru"  

data

It is easier to help if you share data in a reproducible way (like below using dput ) which is easier to copy.

df <- structure(list(country = c("EgyAbusir, Egypt.mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}30°50′30″N 29°39′50″E / 30.84167°N 29.66389°E / 30.84167; 29.66389 (Abu Mena)", 
"Niger1Arlit Department, Niger18°17′N 8°0′E / 18.283°N 8.000°E / 18.283; 8.000 (Air and Ténéré Natural Reserves)", 
"Aleppo Governorate,  Syria36°14′N 37°10′E / 36.233°N 37.167°E / 36.233; 37.167 (Ancient City of Aleppo)", 
"Daraa Governorate,  Syria32°31′5″N 36°28′54″E / 32.51806°N 36.48167°E / 32.51806; 36.48167 (Ancient City of Bosra)", 
"Damascus Governorate,  Syria33°30′41″N 36°18′23″E / 33.51139°N 36.30639°E / 33.51139; 36.30639 (Ancient City of Damascus)", 
"VenFalcón, Venezuela11°25′N 69°40′W / 11.417°N 69.667°W / 11.417; -69.667 (Coro and its Port)", 
"PerLa Libertad, Peru8°6′40″S 79°4′30″W / 8.11111°S 79.07500°W / -8.11111; -79.07500 (Chan Chan Archaeological Zone)"
)), class = "data.frame", row.names = c(NA, -7L))

If the country is always after the first comma, this works:

library(stringr)
str_extract(df$country, "(?<=,\\s{1,10})[A-Za-z]+(?=[^A-Za-z])")
[1] "Egypt"     "Niger"     "Syria"     "Syria"     "Syria"     "Venezuela" "Peru"

@Ronak's data:

df <- structure(list(country = c("EgyAbusir, Egypt.mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:none}.mw-parser-output .longitude,.mw-parser-output .latitude{white-space:nowrap}30°50′30″N 29°39′50″E / 30.84167°N 29.66389°E / 30.84167; 29.66389 (Abu Mena)", 
                                 "Niger1Arlit Department, Niger18°17′N 8°0′E / 18.283°N 8.000°E / 18.283; 8.000 (Air and Ténéré Natural Reserves)", 
                                 "Aleppo Governorate,  Syria36°14′N 37°10′E / 36.233°N 37.167°E / 36.233; 37.167 (Ancient City of Aleppo)", 
                                 "Daraa Governorate,  Syria32°31′5″N 36°28′54″E / 32.51806°N 36.48167°E / 32.51806; 36.48167 (Ancient City of Bosra)", 
                                 "Damascus Governorate,  Syria33°30′41″N 36°18′23″E / 33.51139°N 36.30639°E / 33.51139; 36.30639 (Ancient City of Damascus)", 
                                 "VenFalcón, Venezuela11°25′N 69°40′W / 11.417°N 69.667°W / 11.417; -69.667 (Coro and its Port)", 
                                 "PerLa Libertad, Peru8°6′40″S 79°4′30″W / 8.11111°S 79.07500°W / -8.11111; -79.07500 (Chan Chan Archaeological Zone)"
)), class = "data.frame", row.names = c(NA, -7L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM