I want to shuffle the input file in the code repository before I start filtering on it. Because Foundry preview only loads the top 10000 rows, and when I apply filters on it, it is returning 0 rows since those filters are being applied to the top 10000 rows only (and I am certain that the rows I am looking at are at the bottom of the dataset). I want to shuffle the dataset at the same time it gets loaded into foundry memory so that I can catch some desired rows in my filter.
MY code:
@transform_df(
output=Output("Path_to_output_file"),
input_file=Input("path_to_input_file")
)
def compute(input_file):
keep = ["MALE", "FEMALE"]
new_df = (
input_file
.filter(
# (F.col("SEX") == "MALE") &
(F.col("SEX").isin(keep)) &
(F.col("AGE").rlike(r"^16|17|18|19|[2-8][0-9]|9[0-4]|95\+$")) &
(F.col("ORG_TYPE") == "XXXX")
)
)
return new_df
When I inspected the dataset, only the bottom 15000 rows contain rows of organization_type as "XXXX" out of the total 276K rows of the dataset, therefore foundry preview shows me 0 rows in preview. When I build it, I get all the rows but builds take a lot of time.
If I can somehow shuffle the dataset before applying filters, I can perhaps catch some of those 15000 rows in my filter. Which will help me apply further transformations without building it every time.
Note: I can make this script work in the code workbook, all I want to know is the relevant methodology in the Code repository.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.