簡體   English   中英

R regex / gsub:提取模式的一部分

[英]R regex / gsub : extract part of pattern

我有一個按緯度和經度列出的氣象站及其位置列表。 存在格式問題,其中一些有小時和分鍾,而其他有小時、分鍾和秒。 我可以使用正則表達式找到模式,但我無法提取各個部分。

這里的數據:

> head(wthrStat1 )
     Station       lat        lon
1940    K01R    31-08N    092-34W
1941    K01T    28-08N    094-24W
1942    K03Y    48-47N    096-57W
1943    K04V 38-05-50N 106-10-07W
1944    K05F 31-25-16N 097-47-49W
1945    K06D 48-53-04N 099-37-15W

我想要這樣的東西:

   Station       latHr latMin   latSec  latDir   lonHr lonMin  lonSec lonDir
    1940    K01R    31    08       00      N      092   34       00     W
    1941    K01T    28    08       00      N      094   24       00     W
    1942    K03Y    48    47       00      N      096   57       00     W
    1943    K04V    38    05       50      N      106   10       07     W
    1944    K05F    31    25       16      N      097   47       49     W
    1945    K06D    48    53       04      N      099   37       15     W

我可以匹配到這個正則表達式:

data.format <- "\\d{1,3}-\\d{1,3}(?:-\\d{1,3})?[NSWE]{1}"
grep(data.format, wthrStat1$lat)

但我不確定如何將各個部分放入列中。 我嘗試了一些事情,例如:

wthrStat1$latHr <- ifelse(grepl(data.format, wthrStat1$lat), gsub(????), NA)

但沒有運氣。

這是一個 dput():

> dput(wthrStat1[1:10,] )
structure(list(Station = c("K01R", "K01T", "K03Y", "K04V", "K05F", 
"K06D", "K07G", "K07S", "K08D", "K0B9"), lat = c("31-08N", "28-08N", 
"48-47N", "38-05-50N", "31-25-16N", "48-53-04N", "42-34-28N", 
"47-58-27N", "48-18-03N", "43-20N"), lon = c("092-34W", "094-24W", 
"096-57W", "106-10-07W", "097-47-49W", "099-37-15W", "084-48-41W", 
"117-25-42W", "102-24-23W", "070-24W")), .Names = c("Station", 
"lat", "lon"), row.names = 1940:1949, class = "data.frame")

有什么建議么?

strapplyc包中的 strplyc 將提取正則表達式中括號中的每個組:

library(gsubfn)
data.format <- "(\\d{1,3})-(\\d{1,3})-?(\\d{1,3})?([NSWE]{1})"
parts <- strapplyc(wthrStat1$lat, data.format, simplify = rbind)
parts[parts == ""] <- "00"

這使:

> parts
      [,1] [,2] [,3] [,4]
 [1,] "31" "08" "00" "N" 
 [2,] "28" "08" "00" "N" 
 [3,] "48" "47" "00" "N" 
 [4,] "38" "05" "50" "N" 
 [5,] "31" "25" "16" "N" 
 [6,] "48" "53" "04" "N" 
 [7,] "42" "34" "28" "N" 
 [8,] "47" "58" "27" "N" 
 [9,] "48" "18" "03" "N" 
[10,] "43" "20" "00" "N" 

它非常低效,我希望其他人有更好的解決方案:

dat <- read.table(text ='   Station       lat        lon
1940    K01R    31-08N    092-34W
1941    K01T    28-08N    094-24W
1942    K03Y    48-47N    096-57W
1943    K04V 38-05-50N 106-10-07W
1944    K05F 31-25-16N 097-47-49W
1945    K06D 48-53-04N 099-37-15W', head=T)


pattern <- '([0-9]+)[-]([0-9]+)([-|A-Z]+)([0-9]*)([A-Z]*)'

dat$latHr <- gsub(pattern,'\\1',dat$lat)
dat$latMin    <- gsub(pattern,'\\2',dat$lat)

latSec    <- gsub(pattern,'\\4',dat$lat)
latSec[nchar(latSec)==0] <- '00'
dat$latSec <- latSec

latDir <- gsub(pattern,'\\5',dat$lat)
latDir[nchar(latDir)==0] <- latDir[nchar(latDir)!=0][1]
dat$latDir <- latDir

dat
     Station       lat        lon latHr latMin latSec latDir
1940    K01R    31-08N    092-34W    31     08     00      N
1941    K01T    28-08N    094-24W    28     08     00      N
1942    K03Y    48-47N    096-57W    48     47     00      N
1943    K04V 38-05-50N 106-10-07W    38     05     50      N
1944    K05F 31-25-16N 097-47-49W    31     25     16      N
1945    K06D 48-53-04N 099-37-15W    48     53     04      N

另一個答案,使用stringr

# example data
data <-
"Station       lat        lon
1940    K01R    31-08N    092-34W
1941    K01T    28-08N    094-24W
1942    K03Y    48-47N    096-57W
1943    K04V 38-05-50N 106-10-07W
1944    K05F 31-25-16N 097-47-49W
1945    K06D 48-53-04N 099-37-15W"

## read string into a data.frame
df <- read.table(text=data, head=T, stringsAsFactors=F)

pattern <- "(\\d{1,3})-(\\d{1,3})(?:-(\\d{1,3}))?([NSWE]{1})"

library(stringr)
str_match(df$lat, pattern)

這會生成一個 data.frame,其中一列用於整個匹配字符串,而每個捕獲組都有一列。

     [,1]        [,2] [,3] [,4]  [,5]
[1,] "31-08N"    "31" "08" ""    "N" 
[2,] "28-08N"    "28" "08" ""    "N" 
[3,] "48-47N"    "48" "47" ""    "N" 
[4,] "38-05-50N" "38" "05" "-50" "N" 
[5,] "31-25-16N" "31" "25" "-16" "N" 
[6,] "48-53-04N" "48" "53" "-04" "N"

R的字符串處理能力這幾年進步很大。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM