R-如何从JSON数据中删除特殊字符+转义序列，以便可以在其上使用JSONLITE？

Question

Going to just and provide a ton of info below with data, code, etc to finally get this solved. 只需在下面提供大量信息，数据，代码等，即可最终解决此问题。 So first portion will just be an explanation. 因此，第一部分只是一个解释。

Explanation - Program to automatically extract Tweets from Twitter. 说明 -自动从Twitter提取推文的程序。 For reasons, I am storing certain values from them I care about (Description, Location, TweetID, etc) into a comma-delimited CSV file. 由于某些原因，我会将他们关心的某些值（描述，位置，TweetID等）存储到以逗号分隔的CSV文件中。 I use HTTR to "GET" the tweets, the HTTR::content function to store them, then convert these to a more workable form using jsonlite::fromJSON(toJSON( )) . 我使用HTTR来“获取”这些推文，即HTTR::content函数来存储它们，然后使用jsonlite::fromJSON(toJSON( ))将它们转换为更可行的形式。 This works 90% of the time, but sometimes tweets have invisible escape characters embedded in them. 这在90％的时间内有效，但有时推文中嵌入了不可见的转义字符。 Things like \\003 , GS as it appears in Notepad++, and other such special characters. 诸如\\003 ，出现在Notepad ++中的GS之类的东西，以及其他此类特殊字符。 These cause jsonlite to crash. 这些导致jsonlite崩溃。 I'd like to remove them. 我想删除它们。

So the code that works for 90% of the tweets looks like this: 因此，适用于90％的推文的代码如下所示：

mentions = GET(final_url, sig)
json = content(mentions)
json2 = jsonlite::fromJSON(toJSON(json))
allMentions = json2$statuses
colNames = names( unlist(allMentions[1,], use.names=TRUE ) )

Then a bunch more code to actually parse the tweets and pull out things like user_ids, text and latitude. 然后再添加一堆代码，以实际分析这些推文并提取出诸如user_id，文本和纬度之类的内容。

It crashes here at json2= line. 它在json2 =行崩溃。 With error: 有错误：

Error: lexical error: invalid character inside string.
  Foundation and 42nd President of the United States. Follow 
              (right here) ------^

Or: 要么：

Error: lexical error: invalid character inside string.
   No Mission Too Difficult, No Sacrifice Too Great, Duty First. DAV, VFW.
                               (right here) ------^

So the first set of tweets that produces the first error, there is in notepad++ showing an escape character 003 after of. 因此，第一组产生第一条错误的tweet，在notepad ++中显示在之后的转义字符003。 In the second, you can see a "GS" character after Great. 在第二个中，您可以在Great之后看到一个“ GS”字符。

So the attempted fix was to use Gsub. 因此尝试的修复方法是使用Gsub。 Replace special characters. 替换特殊字符。 The problem was, then my data isn't in UTF-8 format anymore for some reason. 问题是，由于某种原因，我的数据不再采用UTF-8格式。 So then I convert to UTF-8. 因此，我转换为UTF-8。

json = content(mentions)
json = gsub("[\001-\026]*", "", json, fixed=TRUE)
json = iconv(json, "UTF-8")
json2 = jsonlite::fromJSON(toJSON(json))
allMentions = json2$statuses

Now the jsonlite part works! 现在jsonlite部分起作用了！ Perfect, but not really. 完美，但并非如此。

Now I crash at "allMentions = json2$statuses" 现在我崩溃在“ allMentions = json2 $ statuses”

    $ operator is invalid for atomic vectors

Which makes sense, because the output for json2 now is.... 这很有意义，因为json2的输出现在是...。

   [1] "NA"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   [2] "list(completed_in = 0.131, max_id = 660500744261382144, max_id_str =        \"660500744261382146\", next_results = \"?      max_id=660499749334859776&q=%40HillaryClinton%20until%3A2015-11-     01&lang=en&count=100&include_entities=1&result_type=recent\", query = \"%40HillaryClinton+until%3A2015-11-01\", refresh_url = \"?since_id=660500744261382146&q=%40HillaryClinton%20until%3A2015-11-01&lang=en&result_type=recent&include_entities=1\", count = 100, since_id = 658634677922738176, since_id_str = \"658634677922738176\")"

This puts me at a loss. 这让我茫然。 Do I persue fixing the error from this end now, and hunting down this, and risk this screwing up what was working for me previously? 我是否现在就说服从此端修复错误并追根溯源，并冒着这个错误搞砸以前对我有用的东西的风险？ Do I go back to trying to figure out how to get rid of those Escape Characters / Invisible characters some other way? 我是否要尝试找出如何以其他方式摆脱那些逃脱字符/不可见字符的方法？

Any advice on these errors would be met with much appreciation. 对于这些错误的任何建议将不胜感激。

Answer 1

Ok, this might not be the perfect answer - I'm not experienced with R. 好的，这可能不是一个完美的答案-我没有R的经验。

Regarding the bio of @pinoybreed808 : 关于@pinoybreed808 ：

No Mission Too Difficult, No Sacrifice Too Great, Duty First. DAV, VFW.

The unusual character appears after Duty First and before . 不寻常的角色在Duty First之后出现. it's actually U+001D - I've no idea why they're using it, but there are a couple of strategies for coping with unusual characters like this. 它实际上是U + 001D-我不知道他们为什么要使用它，但是有一些应对这种不寻常字符的策略。

Firstly, you can simply gsub them out - although I'm not sure of R syntax, it doesn't look like you're capturing the character correctly. 首先，您可以简单地gsub出来-尽管我不确定R语法，但看起来您没有正确捕获字符。

Secondly, you could try URL Encoding the data before storing it. 其次，您可以在存储数据之前尝试对数据进行URL编码。 That would turn the character into %1D . 这样会将字符变成%1D 。

Thirdly, I don't see why this character is causing issues for JSONlite. 第三，我不明白为什么这个字符会导致JSONlite问题。 It may be worth raising a bug in either R or JSONlite about how it copes with weird characters. 可能值得在R或JSONlite中提出一个有关如何处理怪异字符的错误。

R-如何从JSON数据中删除特殊字符+转义序列，以便可以在其上使用JSONLITE？

问题描述

1 个解决方案

解决方案1
0 2015-10-31 22:04:26

R-如何从JSON数据中删除特殊字符+转义序列，以便可以在其上使用JSONLITE？

问题描述

1 个解决方案

解决方案1 0 2015-10-31 22:04:26

解决方案1
0 2015-10-31 22:04:26