[英]Parse address strings in R
I have address data in R in multiple address formats and would like to parse to at least significant address parts so I can use address to merge multiple datasets.我在 R 中有多种地址格式的地址数据,并且希望至少解析为重要的地址部分,以便我可以使用地址来合并多个数据集。 However since address can be in numerous formats, I need something that can identify unit or apartment, for example, from street and zip code.然而,由于地址可以采用多种格式,我需要一些可以识别单元或公寓的东西,例如,从街道和邮政编码。
The problem:问题:
testaddress1 <- "20 W 34th St, New York, NY 10001"
testaddress2 <- "20 West 34 St, New York City, NY 10001"
testaddress3 <- "20 WEST 34th, NYC, NY 10001"
Is there an easy way in R to parse the address parts? R 中是否有一种简单的方法来解析地址部分? Ideally to the parts below:理想情况下,以下部分:
Number: 20; Direction: West; Street: 34; City: New York; State: NY; Zip: 10001
Also units and recipients in addresses present problems:地址中的单位和收件人也存在问题:
#Problem with units/apartments
testunit1 <- "UNIT 9A 740 Park Ave, New York, NY 10021"
testunit2 <- "740 Park Ave 9A, New York, NY 10021"
testunit3 <- "APT 9A, 740 Park Ave, New York, NY 10021"
#Ideal parse
Unit: 9A; Number: 740; Street: Park Ave; City: New York; State: NY; Zip: 10021
#Problem with recipient
testrec1<- "John Doe UNIT 9A, 740 Park Ave, New York, NY 10021"
testrec2 <- "John Doe, 740 Park Ave 9A, New York, NY 10021"
testrec3 <- "JOHN DOE APT 9A, 740 Park Ave, New York, NY 10021"
#Ideal parse
Recipient: John Doe; Unit: 9A; Number: 740; Street: Park Ave; City: New York; State: NY; Zip: 10021
I found this but it looks like a mess and I had trouble implementing it:https://slu-opengis.github.io/postmastr/articles/postmastr.html我发现了这个,但它看起来一团糟,我在实现它时遇到了麻烦:https ://slu-opengis.github.io/postmastr/articles/postmastr.html
Is there something that parses addresses automatically in R?在 R 中是否有自动解析地址的东西?
postmastr
seems to work pretty well... postmastr
似乎工作得很好......
v.adresses <- c("20 W 34th St, New York, NY 10001",
"20 West 34 St, New York City, NY 10001",
"20 WEST 34th, NYC, NY 10001")
df <- data.frame(address = v.adresses)
library(postmastr)
library(magrittr)
library(tidycensus)
df
#***************************************************************
# STATES and POSTAL CODES #####
#***************************************************************
# Build states dictionary
stateDict <- pm_dictionary(locale = "us", type = "state")
#parse and get states + postalcodes
answer_1 <- df %>%
pm_identify(var = "address") %>%
pm_prep(var = "address", type = "street")
answer <- answer_1 %>%
pm_postal_parse() %>%
pm_state_parse(dictionary = stateDict)
#***************************************************************
# CITIES #####
#***************************************************************
# Create cities dictionary based on states in `answer`
# apikey needed (see postmastr-vignette)
# run below code once
# census_api_key("#####", install = TRUE)
# readRenviron("~/.Renviron")
# end run
cityDict <- pm_dictionary(type = "city", filter = unique(answer$pm.state), locale = "us")
# There seem to be addresses without correct cities
answer %>% pm_city_none(dictionary = cityDict)
# pm.uid pm.address pm.state pm.zip
# <int> <chr> <chr> <chr>
# 1 2 20 West 34 St New York City NY 10001
# 2 3 20 WEST 34th NYC NY 10001
# So we append the cities to the dictionary
missingCity <- pm_append(type = "city",
input = c("New York City", "NYC"),
output = "New York", locale = "us")
# Build new cities dictionary
cityDict <- pm_dictionary(type = "city", filter = unique(answer$pm.state),
append = missingCity, locale = "us")
# Now all line shave cities?
answer %>% pm_city_all(dictionary = cityDict)
#TRUE
# Parse
answer <- answer %>% pm_city_parse(dictionary = cityDict)
# m.uid pm.address pm.city pm.state pm.zip
# <int> <chr> <chr> <chr> <chr>
# 1 1 20 W 34th St New York NY 10001
# 2 2 20 West 34 St New York NY 10001
# 3 3 20 WEST 34th New York NY 10001
#***************************************************************
# HOUSENUMBERS #####
#***************************************************************
answer <- answer %>% pm_house_parse()
# pm.uid pm.address pm.house pm.city pm.state pm.zip
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 W 34th St 20 New York NY 10001
# 2 2 West 34 St 20 New York NY 10001
# 3 3 WEST 34th 20 New York NY 10001
#***************************************************************
# STREETS #####
#***************************************************************
dirsDict <- pm_dictionary(type = "directional", locale = "us")
answer <- answer %>%
pm_streetDir_parse(dictionary = dirsDict) %>%
pm_streetSuf_parse() %>%
pm_street_parse(ordinal = TRUE, drop = TRUE)
pm_replace(answer, source = answer_1)
# pm.uid pm.house pm.preDir pm.street pm.streetSuf pm.city pm.state pm.zip
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 20 W 34th St New York NY 10001
# 2 2 20 W 34 St New York NY 10001
# 3 3 20 W 34th NA New York NY 10001
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.