简体   繁体   中英

Is there a way to test PySpark Regex's?

I'd like to test different inputs to a PySpark regex to see if they fail/succeed before running a build. Is there a way to test this in Foundry before running a full build/checks?

You can downsample your input using the Preview functionality in Authoring, where you can then specify a filter you want to craft your input for testing.

Then, you can run your PySpark code on this custom sample to verify it does what you expect.

You click on the gear in the following view after clicking the Preview button.

采样

Then, you can describe what sample you want.

过滤

After you have this, running your regex on your input will be fast and easy to test.

I am also a fan of writing unit tests. Create a small input df, desired output df, and write a simple function that takes the input, applies the regex, and returns the output.

import pytest
from datetime import date
import pandas as pd  # noqa
import numpy as np
from myproject.analysis.simple_discount import (
    calc
)

columns = [
    "date",
    "id",
    "other",
    "brand",
    "grp_id",
    "amounth",
    "pct",
    "max_amount",
    "unit",
    "total_units"
]

output_columns = [
    "date",
    "id",
    "other",
    "brand",
    "grp_id",
    "amount",
    "pct",
    "max_amount",
    "qty",
    "total_amount"
]


@pytest.fixture
def input_df(spark_session):
    data = [
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 1],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 1],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 1],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 4],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 3],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 4],
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 1],
        ['3/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
        ['6/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
        ['6/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
        ['6/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
        ['6/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
    ]
    pdf = pd.DataFrame(data, columns=columns)
    pdf = pdf.replace({np.nan: None})
    return spark_session.createDataFrame(pdf)


@pytest.fixture
def output_df(spark_session):
    data = [
        ['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 27, 14.580000000000002],
        ['3/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 1, 1.3],
    ]
    pdf = pd.DataFrame(data, columns=columns)
    pdf = pdf.replace({np.nan: None})
    return spark_session.createDataFrame(pdf)


# ======= FIRST RUN CASE

def test_normal_input(input_df, output_df):
    calc_output_df = calc(input_df)
    assert sorted(calc_output_df.collect()) == sorted(output_df.collect())




#
# Folder Structure
#
# transforms-python/
# ├── ...
# └── src/
#     ├── ...
#     ├── myproject/
#     │   ├── ...
#     │   └── analysis/
#     │       ├── ...
#     │       └── simple_discounts.py
#     └── tests/
#         ├── ...
#         └── unit_tests.py

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM