简体   繁体   中英

How to compare two PySpark columns with Pytest, without using dataframes?

A have a situation where I need to compare the columns before I create the dataframe for a test suit, something like that:

import pytest
import pyspark.sql.functions as F

def first_test():
    c1 = F.col("First Column").alias("1st Column")
    c2 = F.col("Second Column").alias("2nd Column")
    c3 = F.col("Second Column").alias("2nd Column")

    print(c1)
    print(c2)
    print(c3)

    assert c1 != c2
    assert c2 == c3

Once I run pytest with -s and -vv options I see the following:

Column<'`First Column` AS `1st Column'>
Column<'`Second Column` AS `2nd Column`'>
Column<'`Second Column` AS `2nd Column`'>

self = Column<'(`First Column` AS `1st Column` = `Second Column` AS `2nd Column`)'>

    def __nonzero__(self) -> None:
        raise ValueError(
>           "Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
            "'~' for 'not' when building DataFrame boolean expressions."
        )
E       ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

The same error if I comment the first assertion (c1==c2) and keep only the second (c2==c3).

How can I assert that 2 columns are the same in this simple case scenario?

Since Python is an incomplete language it's not possible to do it properly, the best solution I found was to use str() to convert the column as string and compare both strings, I also used to compare the data type.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM