简体   繁体   中英

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id.

I am trying to simply print the files to make sure we are reading that correctly from HDFS.

lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()

I have tried foreach and println functions as well and I am not able to display file data. I am working in python and totally new to both python and spark as well.

This is really easy just do a collect You must be sure that all the data fits the memory on your master

my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()

If that is not the case You must just take a sample by using take method.

# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)

Another example using .ipynb:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM