简体   繁体   中英

Converting a JSON file to pyspark dataframe and then to RDD

I have a json file with the below format which i converted to pyspark Dataframe. Converted dataframe is as below.

Below is the tweets data frame:

+-------------+--------------------+-------------------+
|     tweet_id|               tweet|               user|
+-------------+--------------------+-------------------+
|1112223445455|@xxx_yyyzdfgf @Yoko |             user_1|
|1112223445456|sample test tweet   |             user_2|
|1112223445457|test mention @xxx_y |             user_1|
|1112223445458|testing @yyyyy      |             user_3|
|1112223445459|@xxx_yyzdfgdd @frnd |             user_4|
+-------------+--------------------+-------------------+

I am now trying to extract all the mentions (words that start with an "@") from the column - tweet.

I did it by converting it into an RDD and splitting all the lines using the below code.

tweets_rdd = tweets_df.select("tweet").rdd.flatMap(list)
tweets_rdd_split=tweets_rdd.flatMap(lambda text:text.split(" ")).filter(lambda word:word.startswith('@')).map(lambda x:x.split('@')[1])

Now my output is in below format.

[u'xxx_yyyzdfgf',
 u'Yoko',
 u'xxx_y',
 u'yyyyy',
 u'xxx_yyzdfgdd',
 u'frnd']

Every row has the mentions within u' ' . I think its appearing because the initial file is a json file. I tried removing it using functions like split and replace. But its not working. Could someone help me with removing these?

Is there a better approach than this to extract the mentions?

The start u'' is because it is a unicode object.. You can easily convert it to string format.

You can refer to this to understand the difference between unicode and string. What is the difference between u' ' prefix and unicode() in python?

You can map the column using a lambda function

tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))

Initially i tried with

tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))

as suggested by pisall by remove the unicodes.

But there were foreign characters in the tweet which caused encoding error while using str(x). Hence i used the below to correct this issue.

tweets_rdd_split = tweets_rdd_split.map(lambda x: x.encode("ascii","ignore"))

This resolved the encoding issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM