How to print rdd in python in spark

Question

I have two files on HDFS and I just want to join these two files on a column say employee id.

I am trying to simply print the files to make sure we are reading that correctly from HDFS.

lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()

I have tried foreach and println functions as well and I am not able to display file data. I am working in python and totally new to both python and spark as well.

Answer 1

This is really easy just do a collect You must be sure that all the data fits the memory on your master

my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()

If that is not the case You must just take a sample by using take method.

# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)

Another example using .ipynb:

How to print rdd in python in spark

Question

1 answers

solution1
14 2015-10-09 00:22:22

How to print rdd in python in spark

Question

1 answers

solution1 14 2015-10-09 00:22:22

solution1
14 2015-10-09 00:22:22