Hadoop中的Twitter JSON数据

Question

I Have done Twitter data streaming into HDFS. 我已经将Twitter数据流式传输到HDFS中。 This is my Twitter-agent configuration: 这是我的Twitter代理配置：

 #setting properties of agent Twitter-agent.sources=source1 Twitter-agent.channels=channel1 Twitter-agent.sinks=sink1 #configuring sources Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource Twitter-agent.sources.source1.channels=channel1 Twitter-agent.sources.source1.consumerKey=<consumer-key> Twitter-agent.sources.source1.consumerSecret=<consumer-secret> Twitter-agent.sources.source1.accessToken=<access-token> Twitter-agent.sources.source1.accessTokenSecret=<Access-Token-secret> Twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata #configuring channels Twitter-agent.channels.channel1.type=memory Twitter-agent.channels.channel1.capacity=10000 Twitter-agent.channels.channel1.transactionCapacity=100 #configuring sinks Twitter-agent.sinks.sink1.channel=channel1 Twitter-agent.sinks.sink1.type=hdfs Twitter-agent.sinks.sink1.hdfs.path=flume/tweets Twitter-agent.sinks.sink1.rollSize=0 Twitter-agent.sinks.sink1.rollCount=10000 Twitter-agent.sinks.sink1.batchSize=1000 Twitter-agent.sinks.sink1.fileType=DataStream Twitter-agent.sinks.sink1.writeFormat=Text

Twitter Data is streamed successfully. Twitter数据已成功流传输。 But every FlumeData file in HDFS is like this: 但是HDFS中的每个FlumeData文件都是这样的：

 SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable     ^ kd  h? tN    h{"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":null,"source":"<a href=\\"http://tweetlogix.com\\" rel=\\"nofollow\\">Tweetlogix<\\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":null,"id_str":"613363262709723139","in_reply_to_user_id":null,"favorite_count":0,"id":613363262709723139,"text":"Morning.","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172225","entities":{"urls":[],"hashtags":[],"user_mentions":[],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":-14400,"friends_count":195,"profile_image_url_https":"https://pbs.twimg.com/profile_images/613121771093532673/mA5NPv6X_normal.jpg","listed_count":16,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","default_profile_image":false,"favourites_count":891,"description":"See, I was actually on my way to get a piece of burger from Burger King.....","created_at":"Sat Apr 30 00:51:06 +0000 2011","is_translator":false,"profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","protected":false,"screen_name":"NilesDontCurrr","id_str":"290266873","profile_link_color":"FF0000","id":290266873,"geo_enabled":false,"profile_background_color":"FFFFFF","lang":"en","profile_sidebar_border_color":"FFFFFF","profile_text_color":"34AA7A","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/613121771093532673/mA5NPv6X_normal.jpg","time_zone":"Eastern Time (US & Canada)","url":null,"contributors_enabled":false,"profile_background_tile":true,"profile_banner_url":"https://pbs.twimg.com/profile_banners/290266873/1432844093","statuses_count":68154,"follow_request_sent":null,"followers_count":4611,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"niles.","location":"New York City.","profile_sidebar_fill_color":"AFDFB7","notifications":null}}

When I parse this json data in Hive I'm getting errors like 当我在Hive中解析此json数据时，出现类似以下错误

 Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('S' (code 83)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: java.io.StringReader@5fdcaa40; line: 1, column: 2]

I think the error is because of this line which is the first line in every FlumeData File. 我认为错误是因为这一行，它是每个FlumeData文件的第一行。 SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable ^ kd h? tN h Am I right? SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable ^ kd h? tN h我对吗？

Isn't the twitter json data supposed to start like this {"in_reply_to_status_id_str":......} ? Twitter json数据不是应该像这样{"in_reply_to_status_id_str":......}吗？

Answer 1

Flume is generating files in binary format instead of text format. Flume正在以二进制格式而不是文本格式生成文件。 This is because few of the properties in your config file are not set correctly, including below two properties. 这是因为配置文件中的几个属性未正确设置，包括以下两个属性。

Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text

Correct way to set properties is as below. 正确设置属性的方法如下。

Twitter-agent.sinks.sink1.hdfs.fileType=DataStream
Twitter-agent.sinks.sink1.hdfs.writeFormat=Text

Hadoop中的Twitter JSON数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-06-27 18:30:46

Hadoop中的Twitter JSON数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-06-27 18:30:46

解决方案1
1 已采纳 2015-06-27 18:30:46