Joining two dataframes through an inner join and a filter condition on Pyspark (Python)

Question

I need to join two dataframes with an inner join AND a filter condition according to the values of one of the columns in the right dataframe. I've tried with some of the questions that I've seen posted on here but nothing has work so far, could anyone please help out?

I have two dataframes: df_consumos_diarios and df_facturas_mes_actual_flg. They have one key in common: id_cliente

These are the two DFs:

df_consumos_diarios.show(5)
+----------+----------------+------------+----------------------+---------------------+----------+
|id_cliente|consumo_datos_MB|sms_enviados|minutos_llamadas_movil|minutos_llamadas_fijo|     fecha|
+----------+----------------+------------+----------------------+---------------------+----------+
|         1|             664|           3|                    25|                    0|2020-08-01|
|         1|             943|           0|                    12|                    5|2020-08-02|
|         1|            1035|           1|                    46|                   10|2020-08-03|
|         1|             760|           3|                    17|                    0|2020-08-04|
|         1|            1409|           1|                    31|                    4|2020-08-05|


df_facturas_mes_actual_flg.show(5)
+----------+---------+-------+----------+----+-----------+
|id_cliente|id_oferta|importe|     fecha|edad|flg_mes_ant|
+----------+---------+-------+----------+----+-----------+
|         1|        9|   36.5|2020-07-31|  26|          1|
|         1|        6|  118.6|2020-07-31|  26|          1|
|         1|        6|  124.5|2020-07-31|  26|          1|
|         2|        4|   95.0|2020-07-31|  58|          1|
|         3|        5|  102.5|2020-07-31|  68|          1|
+----------+---------+-------+----------+----+-----------+

The reason why I want to do an inner join and not a merge or concatenate is because these are pyspark.sql dataframes, and I thought it was easier this way.

What I want to do is join create a new dataframe out of these two where I only show the values that are NOT equal to 1 under "flg_mes_ant" in the right dataframe. When I write the inner join clause the code works fine, but adding the filter condition messes everything up. This is what I've tried so far:

   df2 = df_consumos_diarios.join(df_facturas_mes_actual_flg, on=["id_cliente"] & 
         [df_facturas_mes_actual_flg["flg_mes_ant"] != "1"], how='inner')

The error message I'm getting is:

TypeError: unsupported operand type(s) for &: 'list' and 'list'

Does anyone know what I'm doing wrong and how I could get past this error?

Answer 1

You can do the filter after the join:

import pyspark.sql.functions as F

df2 = df_consumos_diarios.join(
    df_facturas_mes_actual_flg, 
    on="id_cliente", 
    how='inner'
).filter(F.col("flg_mes_ant") != "1")

Or you can filter the right dataframe before joining (which should be more efficient):

df2 = df_consumos_diarios.join(
    df_facturas_mes_actual_flg.filter(df_facturas_mes_actual_flg["flg_mes_ant"] != "1"), 
    on="id_cliente", 
    how='inner'
)

Joining two dataframes through an inner join and a filter condition on Pyspark (Python)

Question

1 answers

solution1
1 ACCPTED 2021-05-02 09:13:17

Joining two dataframes through an inner join and a filter condition on Pyspark (Python)

Question

1 answers

solution1 1 ACCPTED 2021-05-02 09:13:17

solution1
1 ACCPTED 2021-05-02 09:13:17