簡體   English   中英

根據條件復制 Pandas 數據框中的行並更改特定列的值

[英]Replicate row in Pandas dataframe based on condition and change values for a specific column

Start_Year   End_Year   Opp1              Opp2          Duration
1500         1501       ['A','B']        ['C','D']      1
1500         1510       ['P','Q','R']    ['X','Y']      10
1520         1520       ['A','X']        ['C']          0
...          ....        ........        .....          ..
1809         1820       ['M']            ['F','H','Z']  11

我的數據集(csv 文件格式)是不同實體(國家、州和派系之間的武裝戰爭,由大寫字母 A、B、P、Q 等表示為 Opp1(反對)和 Opp2 列中的列表。Start_Year 和 End_Year 是戰爭開始和結束的時間。持續時間列是通過將 End_Year 的值減去 Start_Year 來創建的。

我想通過戰爭持續時間的因素復制 Duration 大於 0 的那些行,即如果持續時間為 6 年,則復制該行 6 次並將 Duration 值減少 1 並將 Start_Year 增加 1 復制行中的每個復制並保持其他列中的值相同。(如果持續時間為 1 年,那么它應該復制該行 2 次,以便在復制到最后一步后每次戰爭的持續時間變為 0 年)。 我想要的輸出列是這樣的:

我不知道如何進行這樣的事情,因為我是數據科學和分析的初學者。 所以請原諒我在這里沒有顯示任何試用代碼。

Start_Year   End_Year   Opp1              Opp2          Duration
1500         1501       ['A','B']        ['C','D']      1
1501         1501       ['A','B']        ['C','D']      0
1500         1510       ['P','Q','R']    ['X','Y']      10
1501         1510       ['P','Q','R']    ['X','Y']      9
1502         1510       ['P','Q','R']    ['X','Y']      8
1503         1510       ['P','Q','R']    ['X','Y']      7
1504         1510       ['P','Q','R']    ['X','Y']      6
1505         1510       ['P','Q','R']    ['X','Y']      5
....         ....       .............    ........       ..
1510         1510       ['P','Q','R']    ['X','Y']      0
1520         1520       ['A','X']        ['C']          0
...          ....        ........        .....          ..
1809         1820       ['M']            ['F','H','Z']  11
1810         1820       ['M']            ['F','H','Z']  10
....         ....       .....            .............. ..
1820         1820       ['M']            ['F','H','Z']  0 

編輯:1 一些示例數據集數據集

您可以使用pandas.Index.repeat根據列Duration重復行 [ Duration times ],然后使用pandas.core.groupby.GroupBy.cumcount可以將增加的累積值添加到start_year列。

讀取數據

data = [[1500, 1501, ['A','B'], ['C','D'], 1],
        [1500, 1510, ['P','Q','R'], ['X','Y'], 10],
        [1520, 1520, ['A','X'], ['C'], 0],
        [1809, 1820, ['M'], ['F','H','Z'], 11]]
df = pd.DataFrame(data, columns = ['Start_Year', 'End_Year', 'Opp1', 'Opp2', 'Duration'])

重復值

mask = df['Duration'].gt(0)
df1 = df[mask].copy()
df1 = df1.loc[df1.index.repeat(df1['Duration'] + 1)]

為每個組分配遞增的值

df1['Start_Year'] += df1[['Start_Year', 'End_Year', 'Opp1', 'Opp2']].astype(str).groupby(['Start_Year', 'End_Year', 'Opp1', 'Opp2']).cumcount()

產生輸出

df1['Duration'] = df1['End_Year'] - df1['Start_Year']
df = pd.concat([df1, df[~mask]]).sort_index(kind = 'mergesort').reset_index(drop=True)

這給了我們預期的輸出:

    Start_Year  End_Year       Opp1       Opp2  Duration
0         1500      1501     [A, B]     [C, D]         1
1         1501      1501     [A, B]     [C, D]         0
2         1500      1510  [P, Q, R]     [X, Y]        10
3         1501      1510  [P, Q, R]     [X, Y]         9
4         1502      1510  [P, Q, R]     [X, Y]         8
5         1503      1510  [P, Q, R]     [X, Y]         7
6         1504      1510  [P, Q, R]     [X, Y]         6
7         1505      1510  [P, Q, R]     [X, Y]         5
8         1506      1510  [P, Q, R]     [X, Y]         4
9         1507      1510  [P, Q, R]     [X, Y]         3
10        1508      1510  [P, Q, R]     [X, Y]         2
11        1509      1510  [P, Q, R]     [X, Y]         1
12        1510      1510  [P, Q, R]     [X, Y]         0
13        1520      1520     [A, X]        [C]         0
14        1809      1820        [M]  [F, H, Z]        11
15        1810      1820        [M]  [F, H, Z]        10
16        1811      1820        [M]  [F, H, Z]         9
17        1812      1820        [M]  [F, H, Z]         8
18        1813      1820        [M]  [F, H, Z]         7
19        1814      1820        [M]  [F, H, Z]         6
20        1815      1820        [M]  [F, H, Z]         5
21        1816      1820        [M]  [F, H, Z]         4
22        1817      1820        [M]  [F, H, Z]         3
23        1818      1820        [M]  [F, H, Z]         2
24        1819      1820        [M]  [F, H, Z]         1
25        1820      1820        [M]  [F, H, Z]         0

或者

您也可以在Repeating the values之后嘗試相反的方法,方法是在第一次累積減少時分配 Duration。 然后再次計算“Start_Year”

df1['Duration'] = df1[['Start_Year', 'End_Year', 'Opp1', 'Opp2']].astype(str).groupby(['Start_Year', 'End_Year', 'Opp1', 'Opp2']).cumcount(ascending=False)
df1['Start_Year'] = df1['End_Year'] - df1['Duration']
df = pd.concat([df1, df[~mask]]).sort_index(kind = 'mergesort').reset_index(drop=True)

輸出 :

這為您提供了相同的預期輸出:

    Start_Year  End_Year       Opp1       Opp2  Duration
0         1500      1501     [A, B]     [C, D]         1
1         1501      1501     [A, B]     [C, D]         0
2         1500      1510  [P, Q, R]     [X, Y]        10
3         1501      1510  [P, Q, R]     [X, Y]         9
4         1502      1510  [P, Q, R]     [X, Y]         8
5         1503      1510  [P, Q, R]     [X, Y]         7
6         1504      1510  [P, Q, R]     [X, Y]         6
7         1505      1510  [P, Q, R]     [X, Y]         5
8         1506      1510  [P, Q, R]     [X, Y]         4
9         1507      1510  [P, Q, R]     [X, Y]         3
10        1508      1510  [P, Q, R]     [X, Y]         2
11        1509      1510  [P, Q, R]     [X, Y]         1
12        1510      1510  [P, Q, R]     [X, Y]         0
13        1520      1520     [A, X]        [C]         0
14        1809      1820        [M]  [F, H, Z]        11
15        1810      1820        [M]  [F, H, Z]        10
16        1811      1820        [M]  [F, H, Z]         9
17        1812      1820        [M]  [F, H, Z]         8
18        1813      1820        [M]  [F, H, Z]         7
19        1814      1820        [M]  [F, H, Z]         6
20        1815      1820        [M]  [F, H, Z]         5
21        1816      1820        [M]  [F, H, Z]         4
22        1817      1820        [M]  [F, H, Z]         3
23        1818      1820        [M]  [F, H, Z]         2
24        1819      1820        [M]  [F, H, Z]         1
25        1820      1820        [M]  [F, H, Z]         0

您可以使用pandas.DataFrame.reset_index重置索引。

概括 :

基本上,我們在這里所做的是根據Duration列中的值和條件復制行。

我們保存了可能在使用pandas.Index.repeat重復行 [ Duration value times ] 時消失的行,一旦我們在Duration > 0的行上復制和應用邏輯,通過隨后increasing/decreasing累積值使用替換列值pandas.core.groupby.GroupBy.cumcount我們連接兩個dataframe並使用pandas.DataFrame.sort_indexindex上對它們進行排序,因為當我們使用pandas.Index.repeat重復行時索引也應該重復[ Duration value times ]。 因此,索引上的排序將為我們提供與原始數據幀中相同的順序的數據幀。

與發布的其他答案幾乎相同的方法。 但我認為它有點簡化:

df2 = df.apply(lambda x: x.repeat(df['Duration'].iloc[x.index]+1))
counts = df2.loc[df.Duration>1].groupby(['Start_Year', 'End_Year']).cumcount()
df2.loc[df.Duration>1,'Duration'] -= counts
df2.loc[df.Duration>1,'Start_Year'] += counts
df2.drop_duplicates(subset=['Start_Year', 'Duration'], ignore_index=True, inplace=True)

嘗試這個:

(df.assign(Duration = df['Duration'].map(lambda x: np.arange(0,x+1)[::-1])) #create a list of decending numbers from duration and replace duration column
.explode('Duration') #use duration column to create additional rows
.assign(Start_Year = lambda x: x['Start_Year']
    .add(x.groupby(level=0)
    .cumcount()))
    .reset_index(drop=True)) #use groupby cumcount which creates list of ascending numbers and add to year to increase year by one for each row

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM