[英]R regex / gsub : extract part of pattern
我有一個按緯度和經度列出的氣象站及其位置列表。 存在格式問題,其中一些有小時和分鍾,而其他有小時、分鍾和秒。 我可以使用正則表達式找到模式,但我無法提取各個部分。
這里的數據:
> head(wthrStat1 )
Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W
我想要這樣的東西:
Station latHr latMin latSec latDir lonHr lonMin lonSec lonDir
1940 K01R 31 08 00 N 092 34 00 W
1941 K01T 28 08 00 N 094 24 00 W
1942 K03Y 48 47 00 N 096 57 00 W
1943 K04V 38 05 50 N 106 10 07 W
1944 K05F 31 25 16 N 097 47 49 W
1945 K06D 48 53 04 N 099 37 15 W
我可以匹配到這個正則表達式:
data.format <- "\\d{1,3}-\\d{1,3}(?:-\\d{1,3})?[NSWE]{1}"
grep(data.format, wthrStat1$lat)
但我不確定如何將各個部分放入列中。 我嘗試了一些事情,例如:
wthrStat1$latHr <- ifelse(grepl(data.format, wthrStat1$lat), gsub(????), NA)
但沒有運氣。
這是一個 dput():
> dput(wthrStat1[1:10,] )
structure(list(Station = c("K01R", "K01T", "K03Y", "K04V", "K05F",
"K06D", "K07G", "K07S", "K08D", "K0B9"), lat = c("31-08N", "28-08N",
"48-47N", "38-05-50N", "31-25-16N", "48-53-04N", "42-34-28N",
"47-58-27N", "48-18-03N", "43-20N"), lon = c("092-34W", "094-24W",
"096-57W", "106-10-07W", "097-47-49W", "099-37-15W", "084-48-41W",
"117-25-42W", "102-24-23W", "070-24W")), .Names = c("Station",
"lat", "lon"), row.names = 1940:1949, class = "data.frame")
有什么建議么?
strapplyc
包中的 strplyc 將提取正則表達式中括號中的每個組:
library(gsubfn)
data.format <- "(\\d{1,3})-(\\d{1,3})-?(\\d{1,3})?([NSWE]{1})"
parts <- strapplyc(wthrStat1$lat, data.format, simplify = rbind)
parts[parts == ""] <- "00"
這使:
> parts
[,1] [,2] [,3] [,4]
[1,] "31" "08" "00" "N"
[2,] "28" "08" "00" "N"
[3,] "48" "47" "00" "N"
[4,] "38" "05" "50" "N"
[5,] "31" "25" "16" "N"
[6,] "48" "53" "04" "N"
[7,] "42" "34" "28" "N"
[8,] "47" "58" "27" "N"
[9,] "48" "18" "03" "N"
[10,] "43" "20" "00" "N"
它非常低效,我希望其他人有更好的解決方案:
dat <- read.table(text =' Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W', head=T)
pattern <- '([0-9]+)[-]([0-9]+)([-|A-Z]+)([0-9]*)([A-Z]*)'
dat$latHr <- gsub(pattern,'\\1',dat$lat)
dat$latMin <- gsub(pattern,'\\2',dat$lat)
latSec <- gsub(pattern,'\\4',dat$lat)
latSec[nchar(latSec)==0] <- '00'
dat$latSec <- latSec
latDir <- gsub(pattern,'\\5',dat$lat)
latDir[nchar(latDir)==0] <- latDir[nchar(latDir)!=0][1]
dat$latDir <- latDir
dat
Station lat lon latHr latMin latSec latDir
1940 K01R 31-08N 092-34W 31 08 00 N
1941 K01T 28-08N 094-24W 28 08 00 N
1942 K03Y 48-47N 096-57W 48 47 00 N
1943 K04V 38-05-50N 106-10-07W 38 05 50 N
1944 K05F 31-25-16N 097-47-49W 31 25 16 N
1945 K06D 48-53-04N 099-37-15W 48 53 04 N
另一個答案,使用stringr :
# example data
data <-
"Station lat lon
1940 K01R 31-08N 092-34W
1941 K01T 28-08N 094-24W
1942 K03Y 48-47N 096-57W
1943 K04V 38-05-50N 106-10-07W
1944 K05F 31-25-16N 097-47-49W
1945 K06D 48-53-04N 099-37-15W"
## read string into a data.frame
df <- read.table(text=data, head=T, stringsAsFactors=F)
pattern <- "(\\d{1,3})-(\\d{1,3})(?:-(\\d{1,3}))?([NSWE]{1})"
library(stringr)
str_match(df$lat, pattern)
這會生成一個 data.frame,其中一列用於整個匹配字符串,而每個捕獲組都有一列。
[,1] [,2] [,3] [,4] [,5]
[1,] "31-08N" "31" "08" "" "N"
[2,] "28-08N" "28" "08" "" "N"
[3,] "48-47N" "48" "47" "" "N"
[4,] "38-05-50N" "38" "05" "-50" "N"
[5,] "31-25-16N" "31" "25" "-16" "N"
[6,] "48-53-04N" "48" "53" "-04" "N"
R的字符串處理能力這幾年進步很大。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.