简体   繁体   English

R-如何从JSON数据中删除特殊字符+转义序列,以便可以在其上使用JSONLITE?

[英]R - How to remove special characters + escape sequences from JSON data so I can use JSONLITE on it?

Going to just and provide a ton of info below with data, code, etc to finally get this solved. 只需在下面提供大量信息,数据,代码等,即可最终解决此问题。 So first portion will just be an explanation. 因此,第一部分只是一个解释。

Explanation - Program to automatically extract Tweets from Twitter. 说明 -自动从Twitter提取推文的程序。 For reasons, I am storing certain values from them I care about (Description, Location, TweetID, etc) into a comma-delimited CSV file. 由于某些原因,我会将他们关心的某些值(描述,位置,TweetID等)存储到以逗号分隔的CSV文件中。 I use HTTR to "GET" the tweets, the HTTR::content function to store them, then convert these to a more workable form using jsonlite::fromJSON(toJSON( )) . 我使用HTTR来“获取”这些推文,即HTTR::content函数来存储它们,然后使用jsonlite::fromJSON(toJSON( ))将它们转换为更可行的形式。 This works 90% of the time, but sometimes tweets have invisible escape characters embedded in them. 这在90%的时间内有效,但有时推文中嵌​​入了不可见的转义字符。 Things like \\003 , GS as it appears in Notepad++, and other such special characters. 诸如\\003 ,出现在Notepad ++中的GS之类的东西,以及其他此类特殊字符。 These cause jsonlite to crash. 这些导致jsonlite崩溃。 I'd like to remove them. 我想删除它们。

So the code that works for 90% of the tweets looks like this: 因此,适用于90%的推文的代码如下所示:

mentions = GET(final_url, sig)
json = content(mentions)
json2 = jsonlite::fromJSON(toJSON(json))
allMentions = json2$statuses
colNames = names( unlist(allMentions[1,], use.names=TRUE ) )

Then a bunch more code to actually parse the tweets and pull out things like user_ids, text and latitude. 然后再添加一堆代码,以实际分析这些推文并提取出诸如user_id,文本和纬度之类的内容。

It crashes here at json2= line. 在json2 =行崩溃。 With error: 有错误:

Error: lexical error: invalid character inside string.
  Foundation and 42nd President of the United States. Follow 
              (right here) ------^

Or: 要么:

Error: lexical error: invalid character inside string.
   No Mission Too Difficult, No Sacrifice Too Great, Duty First. DAV, VFW.
                               (right here) ------^

So the first set of tweets that produces the first error, there is in notepad++ showing an escape character 003 after of. 因此,第一组产生第一条错误的tweet,在notepad ++中显示在之后的转义字符003。 In the second, you can see a "GS" character after Great. 在第二个中,您可以在Great之后看到一个“ GS”字符。

So the attempted fix was to use Gsub. 因此尝试的修复方法是使用Gsub。 Replace special characters. 替换特殊字符。 The problem was, then my data isn't in UTF-8 format anymore for some reason. 问题是,由于某种原因,我的数据不再采用UTF-8格式。 So then I convert to UTF-8. 因此,我转换为UTF-8。

json = content(mentions)
json = gsub("[\001-\026]*", "", json, fixed=TRUE)
json = iconv(json, "UTF-8")
json2 = jsonlite::fromJSON(toJSON(json))
allMentions = json2$statuses

Now the jsonlite part works! 现在jsonlite部分起作用了! Perfect, but not really. 完美,但并非如此。

Now I crash at "allMentions = json2$statuses" 现在我崩溃在“ allMentions = json2 $ statuses”

    $ operator is invalid for atomic vectors

Which makes sense, because the output for json2 now is.... 这很有意义,因为json2的输出现在是...。

   [1] "NA"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
   [2] "list(completed_in = 0.131, max_id = 660500744261382144, max_id_str =        \"660500744261382146\", next_results = \"?      max_id=660499749334859776&q=%40HillaryClinton%20until%3A2015-11-     01&lang=en&count=100&include_entities=1&result_type=recent\", query = \"%40HillaryClinton+until%3A2015-11-01\", refresh_url = \"?since_id=660500744261382146&q=%40HillaryClinton%20until%3A2015-11-01&lang=en&result_type=recent&include_entities=1\", count = 100, since_id = 658634677922738176, since_id_str = \"658634677922738176\")"

This puts me at a loss. 这让我茫然。 Do I persue fixing the error from this end now, and hunting down this, and risk this screwing up what was working for me previously? 我是否现在就说服从此端修复错误并追根溯源,并冒着这个错误搞砸以前对我有用的东西的风险? Do I go back to trying to figure out how to get rid of those Escape Characters / Invisible characters some other way? 我是否要尝试找出如何以其他方式摆脱那些逃脱字符/不可见字符的方法?

Any advice on these errors would be met with much appreciation. 对于这些错误的任何建议将不胜感激。

Ok, this might not be the perfect answer - I'm not experienced with R. 好的,这可能不是一个完美的答案-我没有R的经验。

Regarding the bio of @pinoybreed808 : 关于@pinoybreed808

No Mission Too Difficult, No Sacrifice Too Great, Duty First. DAV, VFW.

The unusual character appears after Duty First and before . 不寻常的角色在Duty First之后出现. it's actually U+001D - I've no idea why they're using it, but there are a couple of strategies for coping with unusual characters like this. 它实际上是U + 001D-我不知道他们为什么要使用它,但是有一些应对这种不寻常字符的策略。

Firstly, you can simply gsub them out - although I'm not sure of R syntax, it doesn't look like you're capturing the character correctly. 首先,您可以简单地gsub出来-尽管我不确定R语法,但看起来您没有正确捕获字符。

Secondly, you could try URL Encoding the data before storing it. 其次,您可以在存储数据之前尝试对数据进行URL编码。 That would turn the character into %1D . 这样会将字符变成%1D

Thirdly, I don't see why this character is causing issues for JSONlite. 第三,我不明白为什么这个字符会导致JSONlite问题。 It may be worth raising a bug in either R or JSONlite about how it copes with weird characters. 可能值得在R或JSONlite中提出一个有关如何处理怪异字符的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM