[英]Find the values from another data frame with repetitive ids using another frame with unique id in python
I really stuck in this problem and dont have any idea how to solve that.我真的陷入了这个问题,不知道如何解决。 I have two data frame, one is for the humidity and its data are reported every 15 minutes.
我有两个数据框,一个是湿度,它的数据每 15 分钟报告一次。 I have three different sensors for reporting.
我有三种不同的传感器用于报告。 So, the table includes the id, the date, and hour of the reporting.
因此,该表包括报告的 ID、日期和时间。 Here is:
这是:
df_h = pd.DataFrame({'id_h': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3}, 'date': {0: '2021-01-01', 1: '2021-01-01', 2: '2021-01-01', 3: '2021-01-01', 4: '2021-01-01', 5: '2021-01-01'}, 'time_hour': {0: '6:00:00', 1: '6:15:00', 2: '6:00:00', 3: '6:15:00', 4: '6:00:00', 5: '6:15:00'}, 'VALUE': {0: 10, 1: 12, 2: 20, 3: 22, 4: 30, 5: 32}})
id_h date time_hour VALUE
0 1 2021-01-01 6:00:00 10
1 1 2021-01-01 6:15:00 12
2 2 2021-01-01 6:00:00 20
3 2 2021-01-01 6:15:00 22
4 3 2021-01-01 6:00:00 30
5 3 2021-01-01 6:15:00 32
with the following code, I can stick its data together and for each id, in each day, I have the humidity.使用以下代码,我可以将其数据粘贴在一起,并且对于每个 id,我每天都有湿度。
humidity_sticked = df_h.pivot(index=["id_h", "date"], columns="time_hour", values="VALUE")
humidity_sticked.columns = [f"value_{i+1}" for i in range(humidity_sticked.shape[1])]
humidity_sticked =humidity_sticked.reset_index()
As we can see, we have a table with three rows and two columns.
Also, I have another table which shows the temperature.另外,我还有一张显示温度的表格。 But, the id for the weather center is different.
但是,天气中心的 id 是不同的。 For example, for id_h (id of humidity) = 1, 2 we only have the id_t (id of temperature) = 5 .
例如,对于 id_h(湿度的 id)= 1、2,我们只有 id_t(温度的 id)= 5 。 So, we have exact same table for the temperature, but since the ids are different, I can not create the same stick table as humidity.
所以,我们有完全相同的温度表,但由于 id 不同,我不能创建与湿度相同的棒表。 Here is the table for the temperature:
这是温度表:
df_t = pd.DataFrame({'id_t': {0: 5, 1: 5, 2: 5, 3: 5, 4: 7}, 'date': {0: '2021-01-01', 1: '2021-01-01', 2: '2021-01-01', 3: '2021-01-01', 4: '2021-01-01'}, 'time_hour': {0: '6:00:00', 1: '6:15:00', 2: '6:00:00', 3: '6:15:00', 4: '6:00:00'}, 'VALUE': {0: -1, 1: -8, 2: -2, 3: -9, 4: -3}})
id_t date time_hour VALUE
0 5 2021-01-01 6:00:00 -1
1 5 2021-01-01 6:15:00 -8
2 5 2021-01-01 6:00:00 -2
3 5 2021-01-01 6:15:00 -9
4 7 2021-01-01 6:00:00 -3
When I want to stick the values for id_t=5, I got an error.当我想保留 id_t=5 的值时,出现错误。 The desired output which I want is:
我想要的期望输出是:
Explanation: for id_h=1,2 we have two 5. So, for the first two rows we consider as 1, the second two rows as id=2 and the last two rows are for id=3 which are for id_t=7.解释:对于 id_h=1,2,我们有两个 5。因此,对于前两行,我们认为是 1,后两行是 id=2,最后两行是 id=3,即 id_t=7。
Any help can save me!Thanks任何帮助都可以救我!谢谢
update: I've used the merge by the index, however, when I have missing values in one of the data frame, (for example for a specific date, at time 6:00 I have the humidity, but I don't have the temperature).更新:我已经使用了索引合并,但是,当我在一个数据框中缺少值时(例如对于特定日期,在 6:00 时我有湿度,但我没有温度)。 The results are wrong.
结果是错误的。 Here is the the results of the merge by the index, we can see that the time is not same, but it put all them in one row.
这是索引合并的结果,我们可以看到时间不一样,但它把它们都放在了一行。
df_t['rank'] = df_t.id_t.rank(method='dense')
df_h['rank'] = df_h.id_h.rank(method='dense')
df = df_t.merge(df_h, on=['rank', 'date', 'time_hour'], suffixes=['_1', '_2'])
print(df)
Output:输出:
id_t date time_hour VALUE_1 rank id_h VALUE_2
0 5 2021-01-01 6:00:00 -1 1.0 1 10
1 5 2021-01-01 6:00:00 -2 1.0 1 10
2 5 2021-01-01 6:15:00 -8 1.0 1 12
3 5 2021-01-01 6:15:00 -9 1.0 1 12
4 7 2021-01-01 6:00:00 -3 2.0 2 20
You can use the pd.merge
by index
.您可以按
index
使用pd.merge
。 This way is the shortcut to make your 'sticked dataframe'.这种方式是制作“粘贴数据框”的捷径。
pd.merge(df_t, df_h, left_index=True, right_index=True, suffixes=['_t', '_h'])
Output:输出:
id_t date_t time_hour_t VALUE_t id_h date_h time_hour_h \
0 5 2021-01-01 6:00:00 -1 1 2021-01-01 6:00:00
1 5 2021-01-01 6:15:00 -8 1 2021-01-01 6:15:00
2 5 2021-01-01 6:00:00 -2 2 2021-01-01 6:00:00
3 5 2021-01-01 6:15:00 -9 2 2021-01-01 6:15:00
4 7 2021-01-01 6:00:00 -3 3 2021-01-01 6:00:00
VALUE_h
0 10
1 12
2 20
3 22
4 30
The output above contains useless columns, so you can merge df_t
and df_h[only you need to merge]
like below:上面的输出包含无用的列,因此您可以合并
df_t
和df_h[only you need to merge]
,如下所示:
pd.merge(df_t, df_h[['id_h','VALUE']], left_index=True, right_index=True, suffixes=['_t', '_h'])
Output:输出:
id_t date time_hour VALUE_t id_h VALUE_h
0 5 2021-01-01 6:00:00 -1 1 10
1 5 2021-01-01 6:15:00 -8 1 12
2 5 2021-01-01 6:00:00 -2 2 20
3 5 2021-01-01 6:15:00 -9 2 22
4 7 2021-01-01 6:00:00 -3 3 30
This is the simplest way you want.这是您想要的最简单的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.