I have a json file with the below format which i converted to pyspark Dataframe. Converted dataframe is as below.
Below is the tweets data frame:
+-------------+--------------------+-------------------+
| tweet_id| tweet| user|
+-------------+--------------------+-------------------+
|1112223445455|@xxx_yyyzdfgf @Yoko | user_1|
|1112223445456|sample test tweet | user_2|
|1112223445457|test mention @xxx_y | user_1|
|1112223445458|testing @yyyyy | user_3|
|1112223445459|@xxx_yyzdfgdd @frnd | user_4|
+-------------+--------------------+-------------------+
I am now trying to extract all the mentions (words that start with an "@") from the column - tweet.
I did it by converting it into an RDD and splitting all the lines using the below code.
tweets_rdd = tweets_df.select("tweet").rdd.flatMap(list)
tweets_rdd_split=tweets_rdd.flatMap(lambda text:text.split(" ")).filter(lambda word:word.startswith('@')).map(lambda x:x.split('@')[1])
Now my output is in below format.
[u'xxx_yyyzdfgf',
u'Yoko',
u'xxx_y',
u'yyyyy',
u'xxx_yyzdfgdd',
u'frnd']
Every row has the mentions within u' '
. I think its appearing because the initial file is a json file. I tried removing it using functions like split and replace. But its not working. Could someone help me with removing these?
Is there a better approach than this to extract the mentions?
The start u'' is because it is a unicode object.. You can easily convert it to string format.
You can refer to this to understand the difference between unicode and string. What is the difference between u' ' prefix and unicode() in python?
You can map the column using a lambda function
tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))
Initially i tried with
tweets_rdd_split = tweets_rdd_split.map(lambda x: str(x))
as suggested by pisall by remove the unicodes.
But there were foreign characters in the tweet which caused encoding error while using str(x). Hence i used the below to correct this issue.
tweets_rdd_split = tweets_rdd_split.map(lambda x: x.encode("ascii","ignore"))
This resolved the encoding issue.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.