如何在Spark中比较两个rdds？

Question

I have loaded 2 csv files into two spark RDD's, one containing country codes and second containing tweet data. 我已将2个csv文件加载到两个spark RDD中，一个包含国家/地区代码，第二个包含tweet数据。 I am trying find the following: 我正在尝试找到以下内容：

how many different countries are mentioned in the tweets? 这些推文中提到了多少个国家？
compute the total number of times any country is mentioned. 计算提及任何国家的总次数。

Code: 码：

country_lines = sc.textFile('country-data.csv')
words = country_lines.flatMap( lambda country_lines: country_lines.split(" )")
country_tuples = words.map(lambda word : (word, 1))
countryDF = sqlContext.createDataFrame(country_tuples, ["country" , "code"])

tweets = sc.textFile("tweet_data.csv")

I am trying to find how many time each country in the CountryDF occurs in the tweets csv (there is only column with the tweet text). 我试图找到CountryDF中每个国家出现在推文csv中的时间（只有一列包含推文文本）。

country_DF looks like this: country_DF看起来像这样：

Afghanistan  AFG
Albania  ALB
Algeria  ALG
American Samoa   ASA
Andorra  AND

How do I count the occurrence of each country in the tweets pyspark rdd using python? 如何使用python计算推文pyspark rdd中每个国家的发生率？

Answer 1

您可以在tweet.csv中将tweetDF分组以获取每个国家的数量，然后与countryDF一起获得该数量。

df =  tweetDF.groupby("CountryName").count().join(countryDF,["CountryName"])

如何在Spark中比较两个rdds？

问题描述

1 个解决方案

解决方案1
0 2017-03-14 10:38:20

如何在Spark中比较两个rdds？

问题描述

1 个解决方案

解决方案1 0 2017-03-14 10:38:20

解决方案1
0 2017-03-14 10:38:20