简体   繁体   English

JSON 解析:去除R中的所有特殊字符?

[英]JSON Parsing: Removing All Special Characters in R?

I have a file ("my_file") in R that looks something like this:我在 R 中有一个文件(“my_file”),看起来像这样:

  NAME                                                                                                                                                                                     Address_Parse
1 name1 [('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]
2 name2 [('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]

In case the structure is confusing, here is how the file looks like如果结构令人困惑,这里是文件的样子

my_file = structure(list(NAME = c("name1", "name2"), Address_Parse = c("[('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]", 
"[('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]"
)), class = "data.frame", row.names = c(NA, -2L))

In a previous question ( Parsing JSON in R: lexical error - invalid char in json text ), I learned how to parse the JSON elements within this file using the following code:在上一个问题( Parsing JSON in R: lexical error - invalid char in json text )中,我学习了如何使用以下代码解析此文件中的 JSON 元素:

library(dplyr)
library(purrr)
library(stringr)
library(jsonlite)
library(tidyr)

my_file  %>% 
  mutate(Address_Parse = str_replace_all(Address_Parse,
      "\\(([^,]+),\\s*([^)]+)\\)", "\\2:\\1") %>% 
   str_replace(fixed("["), "[{") %>%
   str_replace(fixed("]"), "}]") %>%
   str_replace_all(fixed("'"), '"') %>% 
   map(fromJSON)) %>%
   unnest(Address_Parse) %>%
 type.convert(as.is = TRUE)

This code works for some of my datasets, but sometimes it produces the following error:此代码适用于我的一些数据集,但有时会产生以下错误:

Error: lexical error: inside a string, '\' occurs before a character which it may not.
          [["220", "StreetNumber"], ["O\x92Brien", "StreetName"], ["At
                     (right here) ------^

I tried looking at posts with similar errors (eg lexical error: inside a string, '\' occurs before a character which it may not ).我试着查看有类似错误的帖子(例如词法错误:在字符串中,'\' 出现在它可能不会出现的字符之前)。 When inspecting the data frame itself, there seem to be "special characters" (eg Â: WalkerÂ's) that are causing this problem.检查数据框本身时,似乎存在导致此问题的“特殊字符”(例如:Walker's)。 When such special characters are removed, the above JSON parsing code seems to be working fine.删除此类特殊字符后,上面的 JSON 解析代码似乎可以正常工作。

This brings me to my question:这让我想到了我的问题:

  • Is there a way to somehow "force" the above JSON parsing code to continue even when encountering such special characters?有没有办法以某种方式“强制”上述 JSON 解析代码即使在遇到此类特殊字符时也能继续?

  • And if there is no way to "force" the above JSON parsing code, is there a way to remove/replace ALL special characters in R?如果没有办法“强制”上面的 JSON 解析代码,有没有办法删除/替换 R 中的所有特殊字符? (eg replace "Â" with "A"?) (例如用“A”代替“”?)

I consulted previous posts (eg Remove all special characters from a string in R? ) and there does not seem to be a general method to remove ALL special characters - it would appear as though special characters have to be individually removed.我查阅了以前的帖子(例如,从 R 中的字符串中删除所有特殊字符? ),似乎没有删除所有特殊字符的通用方法 - 似乎必须单独删除特殊字符。 And this seems like a tedious task because it would probably involve manually scanning the data for all possible special characters and then removing them.这似乎是一项乏味的任务,因为它可能涉及手动扫描数据以查找所有可能的特殊字符,然后将其删除。

In any case, could someone please suggest how to proceed with this problem?无论如何,有人可以建议如何处理这个问题吗?

Thank you!谢谢!

From the example i can guess that these are python lists, not JSON. So i use the package reticulate to eval python code (of course, care need to be taken when evaluating arbitrary code):从示例中我可以猜测这些是 python 列表,而不是 JSON。所以我使用 package reticulate来评估 python 代码(当然,评估任意代码时需要小心):

my_file = structure(list(NAME = c("name1", "name2"), 
                         Address_Parse = c("[('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]", 
"[('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]"
)), class = "data.frame", row.names = c(NA, -2L))

lst <- lapply(my_file$Address_Parse, reticulate::py_eval)
result <- do.call(rbind, lapply(lst, function(x) {
  xnames<-lapply(x, "[[", 2)
  xvalues<-lapply(x, "[[", 1)
  setNames(data.frame(xvalues),xnames)}))
result
#>   StreetNumber StreetName StreetType StreetDirection Municipality Province
#> 1          372      river         St               S      toronto       ON
#> 2          208      ocean         St               E      Toronto       ON
#>   PostalCode PostalCode
#> 1        A1C        9R7
#> 2        J8N        1G8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM