简体   繁体   中英

I want to select all records from one dataframe where its value exists/not exists in another dataframe. How to do this using pyspark dataframes?

I have the two pyspark dataframes. I want to select all records from voutdf where its "hash" does not exist in vindf.tx_hash

How to do this using pyspark dataframe.? I tried a semi join but I am ending up with out of memory errors.

voutdf = sqlContext.createDataFrame(voutRDD,["hash", "value","n","pubkey"])

vindf = sqlContext.createDataFrame(vinRDD,["txid", "tx_hash","vout"])

You can do it with left-anti join:

df = voutdf.join(vindf.withColumnRenamed("tx_hash", "hash"), "hash", 'left_anti')

left-anti join:

It takes all rows from the left dataset that don't have their matching in the right dataset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM