简体繁体 English

pyspark 中的行数差异写入 csv

[英]Line count discrepancy in pyspark write csv

原文 2020-04-18 01:21:50 8 1 python/ apache-spark/ pyspark/ hdfs

I have a pyspark dataframe that I want to write to HDFS.我有一个 pyspark dataframe 想写入 HDFS。 I am using the following command: df.write.mode("overwrite").option("header", "true").option("sep", "|").csv(outfile, compression="bzip2")我正在使用以下命令： df.write.mode("overwrite").option("header", "true").option("sep", "|").csv(outfile, compression="bzip2")

I am observing a weird thing.我正在观察一件奇怪的事情。 The dataframe has 366,000 rows which I obtained using the df.count() function. dataframe 有 366,000 行，我使用df.count() function 获得。 However, the output of the write command only has 72, 557 lines (wc -l command).但是写命令的output只有72、557行（wc -l命令）。 Ideally each row should have a corresponding line in the output.理想情况下，每一行都应该在 output 中有对应的行。 Is there anything wrong with the write command I have been using?我一直使用的写命令有什么问题吗？

1 个解决方案

It turns out that there were some rows with all elements as null.事实证明，有些行的所有元素都是 null。 And this led to the discrepancy in the line count.这导致了行数的差异。

And the rows were null because while reading the dataframe, I was passing a manually-defined schema.这些行是 null 因为在读取 dataframe 时，我传递了一个手动定义的模式。 The rows that did not follow the schema got inserted as null rows in the dataframe.不遵循架构的行作为 dataframe 中的 null 行插入。