简体   繁体   中英

Shuffling rows in PySpark dataframe (Foundry code-repository environment)

I want to shuffle the input file in the code repository before I start filtering on it. Because Foundry preview only loads the top 10000 rows, and when I apply filters on it, it is returning 0 rows since those filters are being applied to the top 10000 rows only (and I am certain that the rows I am looking at are at the bottom of the dataset). I want to shuffle the dataset at the same time it gets loaded into foundry memory so that I can catch some desired rows in my filter.

MY code:

@transform_df(
    output=Output("Path_to_output_file"),
    
    input_file=Input("path_to_input_file")
)
def compute(input_file):
    keep = ["MALE", "FEMALE"]
    new_df = (
        input_file
        .filter(
            # (F.col("SEX") == "MALE") &
            (F.col("SEX").isin(keep)) &
            (F.col("AGE").rlike(r"^16|17|18|19|[2-8][0-9]|9[0-4]|95\+$")) &
            (F.col("ORG_TYPE") == "XXXX")
        )
    )
    return new_df

When I inspected the dataset, only the bottom 15000 rows contain rows of organization_type as "XXXX" out of the total 276K rows of the dataset, therefore foundry preview shows me 0 rows in preview. When I build it, I get all the rows but builds take a lot of time.

If I can somehow shuffle the dataset before applying filters, I can perhaps catch some of those 15000 rows in my filter. Which will help me apply further transformations without building it every time.

Note: I can make this script work in the code workbook, all I want to know is the relevant methodology in the Code repository.

I don't think you can shuffle the dataframe, but you can filter the input dataset used in the preview!

After clicking the button in the screenshot below, you could filter to only include rows of organization_type as "XXXX".

如何编辑预览设置

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM