简体   繁体   English

解析 R 中的地址字符串

[英]Parse address strings in R

I have address data in R in multiple address formats and would like to parse to at least significant address parts so I can use address to merge multiple datasets.我在 R 中有多种地址格式的地址数据,并且希望至少解析为重要的地址部分,以便我可以使用地址来合并多个数据集。 However since address can be in numerous formats, I need something that can identify unit or apartment, for example, from street and zip code.然而,由于地址可以采用多种格式,我需要一些可以识别单元或公寓的东西,例如,从街道和邮政编码。

The problem:问题:

testaddress1 <- "20 W 34th St, New York, NY 10001"
testaddress2 <- "20 West 34 St, New York City, NY 10001"
testaddress3 <- "20 WEST 34th, NYC, NY 10001"

Is there an easy way in R to parse the address parts? R 中是否有一种简单的方法来解析地址部分? Ideally to the parts below:理想情况下,以下部分:

Number: 20; Direction: West; Street: 34; City: New York; State: NY; Zip: 10001

Also units and recipients in addresses present problems:地址中的单位和收件人也存在问题:

#Problem with units/apartments
testunit1 <- "UNIT 9A 740 Park Ave, New York, NY 10021"
testunit2 <- "740 Park Ave 9A, New York, NY 10021"
testunit3 <- "APT 9A, 740 Park Ave, New York, NY 10021"

#Ideal parse
Unit: 9A; Number: 740; Street: Park Ave; City: New York; State: NY; Zip: 10021

#Problem with recipient
testrec1<- "John Doe UNIT 9A, 740 Park Ave, New York, NY 10021"
testrec2 <- "John Doe, 740 Park Ave 9A, New York, NY 10021"
testrec3 <- "JOHN DOE APT 9A, 740 Park Ave, New York, NY 10021"

#Ideal parse
Recipient: John Doe; Unit: 9A; Number: 740; Street: Park Ave; City: New York; State: NY; Zip: 10021

I found this but it looks like a mess and I had trouble implementing it:https://slu-opengis.github.io/postmastr/articles/postmastr.html我发现了这个,但它看起来一团糟,我在实现它时遇到了麻烦:https ://slu-opengis.github.io/postmastr/articles/postmastr.html

Is there something that parses addresses automatically in R?在 R 中是否有自动解析地址的东西?

postmastr seems to work pretty well... postmastr似乎工作得很好......

v.adresses <- c("20 W 34th St, New York, NY 10001", 
              "20 West 34 St, New York City, NY 10001", 
              "20 WEST 34th, NYC, NY 10001")

df <- data.frame(address = v.adresses)

library(postmastr)
library(magrittr)
library(tidycensus)
df
#***************************************************************
# STATES and POSTAL CODES #####
#***************************************************************
# Build states dictionary
stateDict <- pm_dictionary(locale = "us", type = "state")
#parse and get states + postalcodes
answer_1 <- df %>%
  pm_identify(var = "address") %>%
  pm_prep(var = "address", type = "street") 

answer <- answer_1 %>% 
  pm_postal_parse() %>%
  pm_state_parse(dictionary = stateDict)

#***************************************************************
# CITIES #####
#***************************************************************
# Create cities dictionary based on states in `answer` 
#  apikey needed (see postmastr-vignette)
# run below code once
#  census_api_key("#####", install = TRUE)
#  readRenviron("~/.Renviron")
# end run
cityDict <- pm_dictionary(type = "city", filter = unique(answer$pm.state), locale = "us")
#  There seem to be addresses without correct cities
answer %>% pm_city_none(dictionary = cityDict)
#   pm.uid pm.address                  pm.state pm.zip
#    <int> <chr>                       <chr>    <chr> 
# 1      2 20 West 34 St New York City NY       10001 
# 2      3 20 WEST 34th NYC            NY       10001 
# So we append the cities to the dictionary
missingCity <- pm_append(type = "city", 
                         input = c("New York City", "NYC"), 
                         output = "New York", locale = "us")
# Build new cities dictionary
cityDict <- pm_dictionary(type = "city", filter = unique(answer$pm.state), 
                          append = missingCity, locale = "us")
# Now all line shave cities?
answer %>% pm_city_all(dictionary = cityDict)
#TRUE
# Parse
answer <- answer %>% pm_city_parse(dictionary = cityDict)
#    m.uid pm.address    pm.city  pm.state pm.zip
#    <int> <chr>         <chr>    <chr>    <chr> 
# 1      1 20 W 34th St  New York NY       10001 
# 2      2 20 West 34 St New York NY       10001 
# 3      3 20 WEST 34th  New York NY       10001 

#***************************************************************
# HOUSENUMBERS #####
#***************************************************************
answer <- answer %>% pm_house_parse()
#   pm.uid pm.address pm.house pm.city  pm.state pm.zip
#    <int> <chr>      <chr>    <chr>    <chr>    <chr> 
# 1      1 W 34th St  20       New York NY       10001 
# 2      2 West 34 St 20       New York NY       10001 
# 3      3 WEST 34th  20       New York NY       10001 

#***************************************************************
# STREETS #####
#***************************************************************
dirsDict <- pm_dictionary(type = "directional", locale = "us")
answer <- answer %>% 
  pm_streetDir_parse(dictionary = dirsDict) %>%
  pm_streetSuf_parse() %>%
  pm_street_parse(ordinal = TRUE, drop = TRUE)

pm_replace(answer, source = answer_1)
#   pm.uid pm.house pm.preDir pm.street pm.streetSuf pm.city  pm.state pm.zip
#    <int> <chr>    <chr>     <chr>     <chr>        <chr>    <chr>    <chr> 
# 1      1 20       W         34th      St           New York NY       10001 
# 2      2 20       W         34        St           New York NY       10001 
# 3      3 20       W         34th      NA           New York NY       10001 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM