PySpark `monotonically_increasing_id()` returns 0 for each row

Question

This code

def foobar(df):
    return (
        df.withColumn("id", monotonically_increasing_id())
        .withColumn("foo", lit("bar"))
        .withColumn("bar", lit("foo"))
    )

somedf = foobar(somedf)
somedf.show() # <-- each `id` has value 0

creates and prints a data frame where each id has value 0.

I am really confused as this is monotonically_increasing_id method description from documentation :

The generated ID is guaranteed to be monotonically increasing and unique , but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

It clearly says that each row will have a unique value and also it points out that each id will be unique among each partition which means that it is safe to use this method in distributed enviroment as each row will have a unique id across all of the nodes.

it puts partition ID in the upper 31bits and record number within each partition in the lower 33 bits

What is even more confusing that on a single instance enviorment (on my local machine) above code works flawlessly (each row has unique id) but when I deploy the same code to AWS and run it on EMR I get only 0s under ids

Answer 1

In case someone also has problem with montonically_increasing_id returning 0s (Issue was much more silly than I anticipated)

Make sure that you aren't casting to int32 because montonically_increasing_id returns int64 and it seems that overflows are casted to 0s

PySpark `monotonically_increasing_id()` returns 0 for each row

Question

1 answers

solution1
0 ACCPTED 2023-01-13 11:05:40

PySpark `monotonically_increasing_id()` returns 0 for each row

Question

1 answers

solution1 0 ACCPTED 2023-01-13 11:05:40

solution1
0 ACCPTED 2023-01-13 11:05:40