简体   繁体   English

使用pyspark将两个csv文件加入一个键值rdd

[英]Joining two csv files into a key-value rdd using pyspark

I am trying to combine two csv files with nothing in common (no key is common) into a key-value paired rdd using pyspark我正在尝试使用 pyspark 将两个没有共同点(没有键是常见的)的 csv 文件组合成一个键值对 rdd

Lets say A.csv has假设 A.csv 有

a
b
c

and B.csv has和 B.csv 有

1
2
3

is there an option in pyspark to get an rdd by joining these two, like this pyspark 中是否有一个选项可以通过加入这两个来获得 rdd,就像这样

a:1
b:2
c:3

of course the number of rows in both the csv files should match.当然,两个 csv 文件中的行数应该匹配。 Is this something that is easy in pyspark or should this be done in regular python first.这在 pyspark 中很容易,还是应该先在常规 python 中完成。 That is, do a nested loop of both the files and then make a tuple of tuples like ((a,1),(b,2)...) and then pass this to parallelize.也就是说,对这两个文件进行嵌套循环,然后创建一个像 ((a,1),(b,2)...) 这样的元组元组,然后将其传递给并行化。

Just a tool solution, showing the general principle, but not focusing on your specific data structures:只是一个工具解决方案,展示了一般原理,但不关注你的具体数据结构:

with file('A.csv','r') as f:
    a = f.read().split('\n')
with file('B.csv','r') as f:
    b = f.read().split('\n')
dic = dict(zip(a,b))

If you got more complex data structures, your should add a CSV parser (eg the csv module from the standard Python library)如果你有更复杂的数据结构,你应该添加一个 CSV 解析器(例如标准 Python 库中的csv模块)

I am writing this for people who may need this in the future.我正在为将来可能需要它的人写这篇文章。 I am just modifying the code from @sciroccorics slightly我只是稍微修改了@sciroccorics 的代码

import os

with open("/dbfs/FileStore/tables/a.csv",'r') as f:
    a = f.read().split('\n')
with open("/dbfs/FileStore/tables/b.csv",'r') as f:
    b = f.read().split('\n')
tup = tuple(zip(a,b))
key_rdd = spark.sparkContext.parallelize(tup)

Notice the use of tuple(zip(a,b))注意tuple(zip(a,b))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM