简体   繁体   English

基于现有 DataFrame 创建新的 DataFrame

[英]Create new DataFrame based on existing DataFrame

I want to reorganize some data in a Pandas DataFrame from an existing DataFrame such that I can have a subset of row values (in the minimal example below 'City' as columns, and remove rows not having either of the chosen values, below 'New York' and 'Amsterdam').我想从现有 DataFrame 重新组织 Pandas DataFrame 中的一些数据,以便我可以拥有行值的子集(在“City”下方的最小示例中作为列,并删除没有任何所选值的行,位于“New”下方约克”和“阿姆斯特丹”)。

To illustrate what I mean by the above, I have a DataFrame that looks like this:为了说明我上面所说的意思,我有一个如下所示的 DataFrame:

# Id |  Start_Time                       | End_Time                           | City       | Price
--------------------------------------
1 | 2022-01-01 00:00:00.0000000 +01:00 | 2022-01-01 01:00:00.0000000 +01:00 | New York   | 100
2 | 2022-01-01 01:00:00.0000000 +01:00 | 2022-01-01 02:00:00.0000000 +01:00 | New York   | 90
3 | 2022-01-01 02:00:00.0000000 +01:00 | 2022-01-01 03:00:00.0000000 +01:00 | New York   | 95
4 | 2022-01-01 01:00:00.0000000 +01:00 | 2022-01-01 02:00:00.0000000 +01:00 | Amsterdam  | 300
5 | 2022-01-01 02:00:00.0000000 +01:00 | 2022-01-01 03:00:00.0000000 +01:00 | Amsterdam  | 250

I would like to create a DataFrame that looks like this:我想创建一个如下所示的 DataFrame:

# Period                                                                | New York | Amsterdam | Difference
------------------------------------------------------------------------------------------
2022-01-01 00:00:00.0000000 +01:00 - 2022-01-01 01:00:00.0000000 +01:00 | 100      | NaN |  NaN          
2022-01-01 01:00:00.0000000 +01:00 - 2022-01-01 02:00:00.0000000 +01:00 | 90       | 300 | -210
2022-01-01 02:00:00.0000000 +01:00 - 2022-01-01 03:00:00.0000000 +01:00 | 95       | 250 | -155

Current code:当前代码:

Here is my code so far:到目前为止,这是我的代码:

import pandas as pd

data = [
    [1, '2022-01-01 00:00:00.0000000 +01:00', '2022-01-01 01:00:00.0000000 +01:00', 'New York', 100],
    [2, '2022-01-01 01:00:00.0000000 +01:00', '2022-01-01 02:00:00.0000000 +01:00', 'New York', 90], 
    [3, '2022-01-01 02:00:00.0000000 +01:00', '2022-01-01 03:00:00.0000000 +01:00', 'New York', 95],
    [4, '2022-01-01 01:00:00.0000000 +01:00', '2022-01-01 02:00:00.0000000 +01:00', 'Amsterdam', 300], 
    [5, '2022-01-01 02:00:00.0000000 +01:00', '2022-01-01 03:00:00.0000000 +01:00', 'Amsterdam', 250]
]

df = pd.DataFrame(data, columns=['Id', 'Start_Time', 'End_Time', 'City', 'Price'])

ny = df[df['City'] == 'New York']
amsterdam = df[df['Auction'] == 'Amsterdam']

# Here I naively try to create a new DataFrame with the price in the two cities as colums.
# I figure I can then add a new column with the difference with df_new['Difference'] = df_new['New York'] - df_new['Amsterdam']
df_new = pd.DataFrame().assign(ny=ny['Price'], amsterdam=amsterdam['Price'])

How do I go from a DataFrame that looks like the first one to one that looks like the latter?如何从看起来像第一个的 DataFrame 转到看起来像后者的 DataFrame?

As order is important we first need to create a categorical column based on City.由于顺序很重要,我们首先需要创建一个基于 City 的分类列。

Then we can create a pivot using pd.pivot_table然后我们可以使用pd.pivot_table创建一个数据透视表

df['period'] = df['Start_Time']+ " - " + df['End_time']
#assuming these are already strings, if not you'll need to cast them.
df['City'] = pd.Categorical(df['City'], 
           ordered=True, categories=['New York', 'Amsterdam'])


df1 = df.pivot_table(index='period', 
             columns='City', values='Price',aggfunc='first')

Finally, your difference column最后,您的差异列

df1['sub'] = df1.diff(axis=1).iloc[:,-1]


print(df1)

City                                                New York  Amsterdam    sub
period
2022-01-01 00:00:00.0000000 +01:00 - 2022-01-01...     100.0        NaN    NaN
2022-01-01 01:00:00.0000000 +01:00 - 2022-01-01...      90.0      300.0  210.0
2022-01-01 02:00:00.0000000 +01:00 - 2022-01-01...      95.0      250.0  155.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM