简体   繁体   中英

Concatenating two DataFrames with respect to dates

I think my problem involves a few parts. What do I have?

  • Two data frames. Both indexed with TimeStamp formats. The time period is similiar, let's say from 14:00 to 18:00 and from 13:30 to 18:30. But the time interval is different (one dataframe with data every 3 sec, one with uncertain time interval about approx. 0.6 sec). The dataframes are different, one includes GPS coordinates (2 columns + index), one NO2 concentrations (1 column + index).

What do I want in the end?

  • One dataframe (indexed with TimeStamps again) with all the 3 columns (GPS + NO2). I want to set the time interval of the index to let's say 1s. That means, both dataframes have to interpolate, as both might not have values at for example 15:30.56 (but at 15:30.55.635 and 15:30.58.001)

What did I try so far?

  • Concenate the two dataframes. But what I got is one dataframe which now includes all the 3 columns I want, but the index is the time of the NO2 dataset and only the columns with NO2 is filled correctly (other two include NaN)

Here is the code line:

allTheData = pd.concat([gpsDataFrame, no2DataFrame], axis=1)

I am new to Pandas and relatively new to Python. Hope you can help me with the two steps:

  1. Create a dataFrame 'allTheData' which includes chronologically all the measured times (either from gps or No2) and the correct data. For example if there is data from 15:30.05 from both dataframes only add one line and include all the 3 columns; if there is only data from gps at 15:30.07 include the gps data and set No2 to NaN or something.

  2. Interpolate the values so that I can choose a 1sec interval and get interpolated data from gps AND no2 for every 1sec, so each row.

Use pandas.resample to adjust the two dataframes to have the same timestamps as index:

import pandas as pd
import numpy as np

# generate some sample data according to your question
date1 = pd.date_range("14:00", "18:00", freq="3S")
df1 = pd.DataFrame({"time": date1, "gps": np.random.rand(len(date1))})
date2 = pd.date_range("13:30", "18:30", freq="600ms")
df2 = pd.DataFrame({"time": date2, "no2": np.random.rand(len(date2))})

# set the timestamps as index
df1 = df1.set_index("time")
df2 = df2.set_index("time")

final_freq = "1S"

# upsample df1, interpolating
df1 = df1.resample(final_freq)
df1 = df1.interpolate(method='linear')    # without this, these entries are NaN

# downsample df2, averaging
df2 = df2.resample(final_freq).mean()

Then you can just join them:

df = df1.join(df2)

Note that you might have to change this slightly if your gps position is a tuple in a single column. In that case you might have to separate it to two columns, latitude and longitude, for the upsampling to work.

Instead of averaging for the downsampling, it might make sense to use a different function. If your NO2 sensor for example reports how much NO2 it saw in the last 0.6 seconds, then you would want .sum() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM