Pyspark Dropping RDD Rows Without Filter

Question

I wrote a Pyspark program that takes two identical copies of the same input file and transforms the data into two new files, each with their own format. I read both files into dataframes, which contain the same number of rows. After that, I change that dataframes back into an RDD and apply different mapping logic to transform the fields of the row (no filters applied while mapping). However, the output dataframes don't contain the same number of rows - they are dropping without explanation.

I have tried changing the order of the logic, printing out the row counts at various stages, etc. The logs contain no errors or warnings, only my own print statements that show the decrease in row count.

print("Input rows (f2): " + str(f2_df_count))
print("Input rows (f1): " + str(f1_df_count))


f2_rdd = f2_temp_df.rdd.map(list).map(lambda line:
    ("A",
    line[52].strip(),
    ...
    line[2].zfill(5))
f2_df = sqlContext.createDataFrame(f2_rdd, f2_SCHEMA).dropDuplicates()
f2_df.write.format(OUTPUT_FORMAT).options(delimiter='|').save(f2_OUTPUT)
f2_count = f2_df.count()


f1_rdd = f1_temp_df.rdd.map(list).map(lambda line:
    ("B",
    line[39],
    ...
    line[13] if line[16] != "D" else "C")
f1_df = sqlContext.createDataFrame(f1_rdd, f1_SCHEMA).dropDuplicates()
f1_df.write.format(OUTPUT_FORMAT).options(delimiter='|').save(f1_OUTPUT)
f1_count = f1_df.count()


print("F2 output rows: " + str(f2_count) + " rows (dropped " + str(f2_df_count - f2_count) + ").")
print("F1 output rows: " + str(f1_count) + " rows (dropped " + str(f1_df_count - f1_count) + ").")

There are no error messages, but my logs show clearly that rows are being dropped. More strangely, they are being dropped inconsistently; f1 is losing a different number of rows from f2.

Input rows (f2): 261
Input rows (f1): 261
F2 output rows: 260 rows (dropped 1).
F1 output rows: 259 rows (dropped 2).

Sometimes on larger runs the difference is higher, on the order of 100-200 rows. I would appreciate if someone could explain what might be happening, and how I can get around it.

Answer 1

The answer is that I was assuming that the duplicates were dropped previously, but I included an extra dropDuplicate() call after re-creating the RDD as a dataframe. Sorry for anyone that spent time on this unnecessarily!

Pyspark Dropping RDD Rows Without Filter

Question

1 answers

solution1
0 2019-07-19 14:19:13

Pyspark Dropping RDD Rows Without Filter

Question

1 answers

solution1 0 2019-07-19 14:19:13

solution1
0 2019-07-19 14:19:13