I currently have a bad solution to a problem, but it works. I would like suggestions on how to improve my existing solution. I have a spark dataframe and which consists of sensor data from many different machines currenlty in a long format. Here is an example:
machine | run | timestep | sensor1 | sensor2 |
A | 1 | 2020-10-11 00:00:10 | 10 | 200 |
A | 1 | 2020-10-11 00:00:20 | 11 | 200 |
A | 1 | 2020-10-11 00:00:30 | 1 | 200 |
B | 1 | 2020-10-11 01:10:10 | 10 | 10 |
B | 1 | 2020-10-11 01:10:20 | 1000 | 5 |
A | 1 | 2020-10-11 00:00:40 | 10 | 200 |
A | 2 | 2020-20-11 00:00:10 | 10 | 200 |
...
I have, in code, a dictionary of machines (keys) with a list of associated time ranges (values). I would like to extract all the information for each specified machine only for the provided time range. For example
{"A": [("2020-10-1 00:00:00", "2020-12-30"), ("2021-1-15", "2021-3-30"))], ...}
is a sample entry in the dictionary. So I would like to extract in this case two sets of data over given time ranges for one equipment. I currently iterate over the dictionary and run one query per time range, given in the dictionary, each result is saved to a file. I then loop through the saved files, and combine all the individual dataframes into one dataframe.
Here is an example of the process that is made in code
for machine, machine_parts in lifetimes.items():
for machine_part in machine_parts:
query = f"""
select `timestamp`, sensor1, run, machine
from database.table
where machine = '{machine}'
and start >= '{machine_part.install}'
and end <= '{machine_part.removal}'
order by start, `timestamp` asc
"""
print(f"Executing query: {query}")
spark = get_spark_context()
df = spark.sql(query).toPandas()
filename = f"{machine}_{machine_part.install}_{machine_part.removal}.csv".replace(
" ", "_"
)
MACHINE_PART_LIFETIME_DIR.mkdir(parents=True, exist_ok=True)
filepath = os.path.join(HEATER_LIFETIME_DIR, filename)
print(f"Saving to: {filepath}")
df.to_csv(filepath, index=False)
print("-" * 20)
Ideally I think it should be possible (and probably better) to have a query which is able to do all of this in one go, instead of running multiple queries, saving output, re-opening, combining into one dataframe and then saving the result. This should allow me to not have to convert each spark dataframe to a pandas one, save to disk and then re-open each and combine into one. Is there a way to do this dynamically with pyspark?
So as @mck suggested I was able to drastically improve this by using a join. For those interested here is the relevant code below I used.
To go from dictionary to spark dataframe:
values = []
for machine, machine_parts in lifetimes.items():
for machine_part in machine_parts:
values.append((machine, machine_part.install, machine_part.removal))
columns = ["machine", "install_date", "removal_date"]
df = spark.createDataFrame(values, columns)
To do the join:
df_joined = df1.join(df).where((df1.machine == df.machine) & (df1.start >= df.install_date) & (df1.end<= df.removal_date))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.