将JSON文件转换为pyspark数据框，然后转换为RDD

Question

I have a json file with the below format which i converted to pyspark Dataframe. 我有以下格式的json文件，我将其转换为pyspark Dataframe。 Converted dataframe is as below. 转换后的数据帧如下。

Below is the tweets data frame: 以下是tweets数据框：

+-------------+--------------------+-------------------+
|     tweet_id|               tweet|               user|
+-------------+--------------------+-------------------+
|1112223445455|@xxx_yyyzdfgf @Yoko |             user_1|
|1112223445456|sample test tweet   |             user_2|
|1112223445457|test mention @xxx_y |             user_1|
|1112223445458|testing @yyyyy      |             user_3|
|1112223445459|@xxx_yyzdfgdd @frnd |             user_4|
+-------------+--------------------+-------------------+

I am now trying to extract all the mentions (words that start with an "@") from the column - tweet. 我现在正尝试从列-tweet中提取所有提及（以“ @”开头的单词）。

I did it by converting it into an RDD and splitting all the lines using the below code. 我通过将其转换为RDD并使用下面的代码分割所有行来实现。

tweets_rdd = tweets_df.select("tweet").rdd.flatMap(list)
tweets_rdd_split=tweets_rdd.flatMap(lambda text:text.split(" ")).filter(lambda word:word.startswith('@')).map(lambda x:x.split('@')[1])

Now my output is in below format. 现在我的输出为以下格式。

[u'xxx_yyyzdfgf',
 u'Yoko',
 u'xxx_y',
 u'yyyyy',
 u'xxx_yyzdfgdd',
 u'frnd']

Every row has the mentions within u' ' . 每行在u' '都有提及。 I think its appearing because the initial file is a json file. 我认为它的出现是因为初始文件是json文件。 I tried removing it using functions like split and replace. 我尝试使用拆分和替换之类的功能将其删除。 But its not working. 但是它不起作用。 Could someone help me with removing these? 有人可以帮我删除这些内容吗？

Is there a better approach than this to extract the mentions? 有没有比这更好的方法来提取提及？

Answer 1

The start u'' is because it is a unicode object.. You can easily convert it to string format. 开头u''是因为它是一个unicode对象。您可以轻松地将其转换为字符串格式。

You can refer to this to understand the difference between unicode and string. 您可以参考此文件以了解unicode和string之间的区别。 What is the difference between u' ' prefix and unicode() in python? u'前缀和python中的unicode（）有什么区别？

You can map the column using a lambda function 您可以使用lambda函数映射列

tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))

Answer 2

Initially i tried with 最初我尝试过

tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))

as suggested by pisall by remove the unicodes. 根据pisall的建议，删除unicode。

But there were foreign characters in the tweet which caused encoding error while using str(x). 但是鸣叫中有外来字符，使用str（x）时会导致编码错误。 Hence i used the below to correct this issue. 因此，我使用以下内容来更正此问题。

tweets_rdd_split = tweets_rdd_split.map(lambda x: x.encode("ascii","ignore"))

This resolved the encoding issue. 这解决了编码问题。

将JSON文件转换为pyspark数据框，然后转换为RDD

问题描述

2 个解决方案

解决方案1
1 2017-12-29 08:22:19

解决方案2
1 2017-12-29 14:38:50

将JSON文件转换为pyspark数据框，然后转换为RDD

问题描述

2 个解决方案

解决方案1 1 2017-12-29 08:22:19

解决方案2 1 2017-12-29 14:38:50

解决方案1
1 2017-12-29 08:22:19

解决方案2
1 2017-12-29 14:38:50