[英]Subtracting values of a row for a specific column based on a specific condition in python dataframe
[英]Replicate row in Pandas dataframe based on condition and change values for a specific column
Start_Year End_Year Opp1 Opp2 Duration
1500 1501 ['A','B'] ['C','D'] 1
1500 1510 ['P','Q','R'] ['X','Y'] 10
1520 1520 ['A','X'] ['C'] 0
... .... ........ ..... ..
1809 1820 ['M'] ['F','H','Z'] 11
我的數據集(csv 文件格式)是不同實體(國家、州和派系之間的武裝戰爭,由大寫字母 A、B、P、Q 等表示為 Opp1(反對)和 Opp2 列中的列表。Start_Year 和 End_Year 是戰爭開始和結束的時間。持續時間列是通過將 End_Year 的值減去 Start_Year 來創建的。
我想通過戰爭持續時間的因素復制 Duration 大於 0 的那些行,即如果持續時間為 6 年,則復制該行 6 次並將 Duration 值減少 1 並將 Start_Year 增加 1 復制行中的每個復制並保持其他列中的值相同。(如果持續時間為 1 年,那么它應該復制該行 2 次,以便在復制到最后一步后每次戰爭的持續時間變為 0 年)。 我想要的輸出列是這樣的:
我不知道如何進行這樣的事情,因為我是數據科學和分析的初學者。 所以請原諒我在這里沒有顯示任何試用代碼。
Start_Year End_Year Opp1 Opp2 Duration
1500 1501 ['A','B'] ['C','D'] 1
1501 1501 ['A','B'] ['C','D'] 0
1500 1510 ['P','Q','R'] ['X','Y'] 10
1501 1510 ['P','Q','R'] ['X','Y'] 9
1502 1510 ['P','Q','R'] ['X','Y'] 8
1503 1510 ['P','Q','R'] ['X','Y'] 7
1504 1510 ['P','Q','R'] ['X','Y'] 6
1505 1510 ['P','Q','R'] ['X','Y'] 5
.... .... ............. ........ ..
1510 1510 ['P','Q','R'] ['X','Y'] 0
1520 1520 ['A','X'] ['C'] 0
... .... ........ ..... ..
1809 1820 ['M'] ['F','H','Z'] 11
1810 1820 ['M'] ['F','H','Z'] 10
.... .... ..... .............. ..
1820 1820 ['M'] ['F','H','Z'] 0
編輯:1 一些示例數據集數據集
您可以使用pandas.Index.repeat
根據列Duration
重復行 [ Duration times
],然后使用pandas.core.groupby.GroupBy.cumcount
可以將增加的累積值添加到start_year
列。
data = [[1500, 1501, ['A','B'], ['C','D'], 1],
[1500, 1510, ['P','Q','R'], ['X','Y'], 10],
[1520, 1520, ['A','X'], ['C'], 0],
[1809, 1820, ['M'], ['F','H','Z'], 11]]
df = pd.DataFrame(data, columns = ['Start_Year', 'End_Year', 'Opp1', 'Opp2', 'Duration'])
mask = df['Duration'].gt(0)
df1 = df[mask].copy()
df1 = df1.loc[df1.index.repeat(df1['Duration'] + 1)]
df1['Start_Year'] += df1[['Start_Year', 'End_Year', 'Opp1', 'Opp2']].astype(str).groupby(['Start_Year', 'End_Year', 'Opp1', 'Opp2']).cumcount()
df1['Duration'] = df1['End_Year'] - df1['Start_Year']
df = pd.concat([df1, df[~mask]]).sort_index(kind = 'mergesort').reset_index(drop=True)
這給了我們預期的輸出:
Start_Year End_Year Opp1 Opp2 Duration
0 1500 1501 [A, B] [C, D] 1
1 1501 1501 [A, B] [C, D] 0
2 1500 1510 [P, Q, R] [X, Y] 10
3 1501 1510 [P, Q, R] [X, Y] 9
4 1502 1510 [P, Q, R] [X, Y] 8
5 1503 1510 [P, Q, R] [X, Y] 7
6 1504 1510 [P, Q, R] [X, Y] 6
7 1505 1510 [P, Q, R] [X, Y] 5
8 1506 1510 [P, Q, R] [X, Y] 4
9 1507 1510 [P, Q, R] [X, Y] 3
10 1508 1510 [P, Q, R] [X, Y] 2
11 1509 1510 [P, Q, R] [X, Y] 1
12 1510 1510 [P, Q, R] [X, Y] 0
13 1520 1520 [A, X] [C] 0
14 1809 1820 [M] [F, H, Z] 11
15 1810 1820 [M] [F, H, Z] 10
16 1811 1820 [M] [F, H, Z] 9
17 1812 1820 [M] [F, H, Z] 8
18 1813 1820 [M] [F, H, Z] 7
19 1814 1820 [M] [F, H, Z] 6
20 1815 1820 [M] [F, H, Z] 5
21 1816 1820 [M] [F, H, Z] 4
22 1817 1820 [M] [F, H, Z] 3
23 1818 1820 [M] [F, H, Z] 2
24 1819 1820 [M] [F, H, Z] 1
25 1820 1820 [M] [F, H, Z] 0
您也可以在Repeating the values
之后嘗試相反的方法,方法是在第一次累積減少時分配 Duration。 然后再次計算“Start_Year”
df1['Duration'] = df1[['Start_Year', 'End_Year', 'Opp1', 'Opp2']].astype(str).groupby(['Start_Year', 'End_Year', 'Opp1', 'Opp2']).cumcount(ascending=False)
df1['Start_Year'] = df1['End_Year'] - df1['Duration']
df = pd.concat([df1, df[~mask]]).sort_index(kind = 'mergesort').reset_index(drop=True)
這為您提供了相同的預期輸出:
Start_Year End_Year Opp1 Opp2 Duration
0 1500 1501 [A, B] [C, D] 1
1 1501 1501 [A, B] [C, D] 0
2 1500 1510 [P, Q, R] [X, Y] 10
3 1501 1510 [P, Q, R] [X, Y] 9
4 1502 1510 [P, Q, R] [X, Y] 8
5 1503 1510 [P, Q, R] [X, Y] 7
6 1504 1510 [P, Q, R] [X, Y] 6
7 1505 1510 [P, Q, R] [X, Y] 5
8 1506 1510 [P, Q, R] [X, Y] 4
9 1507 1510 [P, Q, R] [X, Y] 3
10 1508 1510 [P, Q, R] [X, Y] 2
11 1509 1510 [P, Q, R] [X, Y] 1
12 1510 1510 [P, Q, R] [X, Y] 0
13 1520 1520 [A, X] [C] 0
14 1809 1820 [M] [F, H, Z] 11
15 1810 1820 [M] [F, H, Z] 10
16 1811 1820 [M] [F, H, Z] 9
17 1812 1820 [M] [F, H, Z] 8
18 1813 1820 [M] [F, H, Z] 7
19 1814 1820 [M] [F, H, Z] 6
20 1815 1820 [M] [F, H, Z] 5
21 1816 1820 [M] [F, H, Z] 4
22 1817 1820 [M] [F, H, Z] 3
23 1818 1820 [M] [F, H, Z] 2
24 1819 1820 [M] [F, H, Z] 1
25 1820 1820 [M] [F, H, Z] 0
您可以使用pandas.DataFrame.reset_index
重置索引。
基本上,我們在這里所做的是根據Duration
列中的值和條件復制行。
我們保存了可能在使用pandas.Index.repeat
重復行 [ Duration value times
] 時消失的行,一旦我們在Duration > 0
的行上復制和應用邏輯,通過隨后increasing/decreasing
累積值使用替換列值pandas.core.groupby.GroupBy.cumcount
我們連接兩個dataframe
並使用pandas.DataFrame.sort_index
在index
上對它們進行排序,因為當我們使用pandas.Index.repeat
重復行時索引也應該重復[ Duration value times
]。 因此,索引上的排序將為我們提供與原始數據幀中相同的順序的數據幀。
與發布的其他答案幾乎相同的方法。 但我認為它有點簡化:
df2 = df.apply(lambda x: x.repeat(df['Duration'].iloc[x.index]+1))
counts = df2.loc[df.Duration>1].groupby(['Start_Year', 'End_Year']).cumcount()
df2.loc[df.Duration>1,'Duration'] -= counts
df2.loc[df.Duration>1,'Start_Year'] += counts
df2.drop_duplicates(subset=['Start_Year', 'Duration'], ignore_index=True, inplace=True)
嘗試這個:
(df.assign(Duration = df['Duration'].map(lambda x: np.arange(0,x+1)[::-1])) #create a list of decending numbers from duration and replace duration column
.explode('Duration') #use duration column to create additional rows
.assign(Start_Year = lambda x: x['Start_Year']
.add(x.groupby(level=0)
.cumcount()))
.reset_index(drop=True)) #use groupby cumcount which creates list of ascending numbers and add to year to increase year by one for each row
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.