简体   繁体   中英

How to combine multiple pyspark sql queries to the same table into one query

I currently have a bad solution to a problem, but it works. I would like suggestions on how to improve my existing solution. I have a spark dataframe and which consists of sensor data from many different machines currenlty in a long format. Here is an example:

machine | run | timestep            | sensor1 | sensor2 |
A       |  1  | 2020-10-11 00:00:10 | 10      | 200     |
A       |  1  | 2020-10-11 00:00:20 | 11      | 200     |
A       |  1  | 2020-10-11 00:00:30 | 1       | 200     |
B       |  1  | 2020-10-11 01:10:10 | 10      | 10      |
B       |  1  | 2020-10-11 01:10:20 | 1000    | 5       |
A       |  1  | 2020-10-11 00:00:40 | 10      | 200     |
A       |  2  | 2020-20-11 00:00:10 | 10      | 200     |
...

I have, in code, a dictionary of machines (keys) with a list of associated time ranges (values). I would like to extract all the information for each specified machine only for the provided time range. For example

{"A": [("2020-10-1 00:00:00", "2020-12-30"), ("2021-1-15", "2021-3-30"))], ...}

is a sample entry in the dictionary. So I would like to extract in this case two sets of data over given time ranges for one equipment. I currently iterate over the dictionary and run one query per time range, given in the dictionary, each result is saved to a file. I then loop through the saved files, and combine all the individual dataframes into one dataframe.

Here is an example of the process that is made in code

    for machine, machine_parts in lifetimes.items():
        for machine_part in machine_parts:
            query = f"""
            select `timestamp`, sensor1, run, machine
            from database.table
            where machine = '{machine}'
            and start >= '{machine_part.install}'
            and end <= '{machine_part.removal}'
            order by start, `timestamp` asc
            """

            print(f"Executing query: {query}")
            spark = get_spark_context()
            df = spark.sql(query).toPandas()

            filename = f"{machine}_{machine_part.install}_{machine_part.removal}.csv".replace(
                " ", "_"
            )

            MACHINE_PART_LIFETIME_DIR.mkdir(parents=True, exist_ok=True)

            filepath = os.path.join(HEATER_LIFETIME_DIR, filename)
            print(f"Saving to: {filepath}")
            df.to_csv(filepath, index=False)
            print("-" * 20)

Ideally I think it should be possible (and probably better) to have a query which is able to do all of this in one go, instead of running multiple queries, saving output, re-opening, combining into one dataframe and then saving the result. This should allow me to not have to convert each spark dataframe to a pandas one, save to disk and then re-open each and combine into one. Is there a way to do this dynamically with pyspark?

So as @mck suggested I was able to drastically improve this by using a join. For those interested here is the relevant code below I used.

To go from dictionary to spark dataframe:

values = []
for machine, machine_parts in lifetimes.items():
    for machine_part in machine_parts:
        values.append((machine, machine_part.install, machine_part.removal))
columns = ["machine", "install_date", "removal_date"]
df = spark.createDataFrame(values, columns)

To do the join:

df_joined = df1.join(df).where((df1.machine == df.machine) & (df1.start >= df.install_date) & (df1.end<= df.removal_date))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM