简体   繁体   中英

JSON Parsing: Removing All Special Characters in R?

I have a file ("my_file") in R that looks something like this:

  NAME                                                                                                                                                                                     Address_Parse
1 name1 [('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]
2 name2 [('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]

In case the structure is confusing, here is how the file looks like

my_file = structure(list(NAME = c("name1", "name2"), Address_Parse = c("[('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]", 
"[('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]"
)), class = "data.frame", row.names = c(NA, -2L))

In a previous question ( Parsing JSON in R: lexical error - invalid char in json text ), I learned how to parse the JSON elements within this file using the following code:

library(dplyr)
library(purrr)
library(stringr)
library(jsonlite)
library(tidyr)

my_file  %>% 
  mutate(Address_Parse = str_replace_all(Address_Parse,
      "\\(([^,]+),\\s*([^)]+)\\)", "\\2:\\1") %>% 
   str_replace(fixed("["), "[{") %>%
   str_replace(fixed("]"), "}]") %>%
   str_replace_all(fixed("'"), '"') %>% 
   map(fromJSON)) %>%
   unnest(Address_Parse) %>%
 type.convert(as.is = TRUE)

This code works for some of my datasets, but sometimes it produces the following error:

Error: lexical error: inside a string, '\' occurs before a character which it may not.
          [["220", "StreetNumber"], ["O\x92Brien", "StreetName"], ["At
                     (right here) ------^

I tried looking at posts with similar errors (eg lexical error: inside a string, '\' occurs before a character which it may not ). When inspecting the data frame itself, there seem to be "special characters" (eg Â: WalkerÂ's) that are causing this problem. When such special characters are removed, the above JSON parsing code seems to be working fine.

This brings me to my question:

  • Is there a way to somehow "force" the above JSON parsing code to continue even when encountering such special characters?

  • And if there is no way to "force" the above JSON parsing code, is there a way to remove/replace ALL special characters in R? (eg replace "Â" with "A"?)

I consulted previous posts (eg Remove all special characters from a string in R? ) and there does not seem to be a general method to remove ALL special characters - it would appear as though special characters have to be individually removed. And this seems like a tedious task because it would probably involve manually scanning the data for all possible special characters and then removing them.

In any case, could someone please suggest how to proceed with this problem?

Thank you!

From the example i can guess that these are python lists, not JSON. So i use the package reticulate to eval python code (of course, care need to be taken when evaluating arbitrary code):

my_file = structure(list(NAME = c("name1", "name2"), 
                         Address_Parse = c("[('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]", 
"[('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]"
)), class = "data.frame", row.names = c(NA, -2L))

lst <- lapply(my_file$Address_Parse, reticulate::py_eval)
result <- do.call(rbind, lapply(lst, function(x) {
  xnames<-lapply(x, "[[", 2)
  xvalues<-lapply(x, "[[", 1)
  setNames(data.frame(xvalues),xnames)}))
result
#>   StreetNumber StreetName StreetType StreetDirection Municipality Province
#> 1          372      river         St               S      toronto       ON
#> 2          208      ocean         St               E      Toronto       ON
#>   PostalCode PostalCode
#> 1        A1C        9R7
#> 2        J8N        1G8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM