Converting a JSON file to pyspark dataframe and then to RDD

Question

I have a json file with the below format which i converted to pyspark Dataframe. Converted dataframe is as below.

Below is the tweets data frame:

+-------------+--------------------+-------------------+
|     tweet_id|               tweet|               user|
+-------------+--------------------+-------------------+
|1112223445455|@xxx_yyyzdfgf @Yoko |             user_1|
|1112223445456|sample test tweet   |             user_2|
|1112223445457|test mention @xxx_y |             user_1|
|1112223445458|testing @yyyyy      |             user_3|
|1112223445459|@xxx_yyzdfgdd @frnd |             user_4|
+-------------+--------------------+-------------------+

I am now trying to extract all the mentions (words that start with an "@") from the column - tweet.

I did it by converting it into an RDD and splitting all the lines using the below code.

tweets_rdd = tweets_df.select("tweet").rdd.flatMap(list)
tweets_rdd_split=tweets_rdd.flatMap(lambda text:text.split(" ")).filter(lambda word:word.startswith('@')).map(lambda x:x.split('@')[1])

Now my output is in below format.

[u'xxx_yyyzdfgf',
 u'Yoko',
 u'xxx_y',
 u'yyyyy',
 u'xxx_yyzdfgdd',
 u'frnd']

Every row has the mentions within u' ' . I think its appearing because the initial file is a json file. I tried removing it using functions like split and replace. But its not working. Could someone help me with removing these?

Is there a better approach than this to extract the mentions?

Answer 1

The start u'' is because it is a unicode object.. You can easily convert it to string format.

You can refer to this to understand the difference between unicode and string. What is the difference between u' ' prefix and unicode() in python?

You can map the column using a lambda function

tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))

Answer 2

Initially i tried with

tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))

as suggested by pisall by remove the unicodes.

But there were foreign characters in the tweet which caused encoding error while using str(x). Hence i used the below to correct this issue.

tweets_rdd_split = tweets_rdd_split.map(lambda x: x.encode("ascii","ignore"))

This resolved the encoding issue.

Converting a JSON file to pyspark dataframe and then to RDD

Question

2 answers

solution1
1 2017-12-29 08:22:19

solution2
1 2017-12-29 14:38:50

Converting a JSON file to pyspark dataframe and then to RDD

Question

2 answers

solution1 1 2017-12-29 08:22:19

solution2 1 2017-12-29 14:38:50

solution1
1 2017-12-29 08:22:19

solution2
1 2017-12-29 14:38:50