[英]Count row value change for each group in pandas DataFrame
我在 pandas 中有一個DataFrame
,其中包含有關人員及時位置的信息。 它大約有 300+ 百萬行。
樣本:
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
print (df)
Output:
Address Name Year
0 Beverly hills John 2018
1 Beverly hills John 2018
2 Beverly hills John 2019
3 Orange county John 2019
4 New York John 2019
5 Canada Steve 2018
6 Canada Steve 2019
7 Canada Steve 2019
8 California Steve 2020
9 Canada Steve 2020
10 Canada John 2020
11 Canada John 2021
12 Beverly hills John 2021
13 California Steve 2021
14 California Steve 2022
15 NewYork Steve 2018
16 California Steve 2018
17 NewYork Steve 2022
我想計算特定Year中Addresses之間的總變化。 或者換句話說,有多少人在 2018 年從“加拿大”搬到“加利福尼亞”。
理想輸出:
1)每年的矩陣如下。 示例:2019 年(包括 2018 年至 2019 年)的所有地址變化。
+---------------+---------------+---------------+----------+------------+
| From\ To | Beverly hills | Orange county | New York | California |
+---------------+---------------+---------------+----------+------------+
| Beverly hills | 0 | 1 | 0 | 0 |
+---------------+---------------+---------------+----------+------------+
| Orange county | 0 | 0 | 1 | 0 |
+---------------+---------------+---------------+----------+------------+
| New York | 0 | 2 | 0 | 0 |
+---------------+---------------+---------------+----------+------------+
| California | 0 | 0 | 0 | 0 |
+---------------+---------------+---------------+----------+------------+
2)所有年份的地址變更。
+---------------+---------------+------+------+------+
| Address 1 | Address 2 | 2018 | 2019 | 2020 |
+---------------+---------------+------+------+------+
| Beverly hills | Orange county | 0 | 1 | 0 |
+---------------+---------------+------+------+------+
| New York | Canada | 0 | 0 | 1 |
+---------------+---------------+------+------+------+
| Canada | New York | 1 | 0 | 0 |
+---------------+---------------+------+------+------+
| California | Canada | 0 | 1 | 2 |
+---------------+---------------+------+------+------+
到目前為止我的解決方案:感謝@QuangHoang,我可以使用以下代碼捕獲“年份”的變化和“地址”的變化:
groups = df.groupby('Name')
for col in ['Year', 'Address']:
df[f'cng-{col}'] = groups[col].shift().fillna(df[col]).ne(df[col]).astype(int)
groups[col].shift()
在每個名稱中將相應的列移動 1。 fillna(df[col]
用原始值填充每個(移位的)組中的第一行,表示沒有變化。最后, ne(df[col])
將移位值與原始值進行比較以進行更改。
產量:
+----+---------------+-------+------+----------+-------------+
| ID | Address | Name | Year | cng-Year | cng-Address |
+----+---------------+-------+------+----------+-------------+
| 0 | Beverly hills | John | 2018 | 0 | 0 |
+----+---------------+-------+------+----------+-------------+
| 1 | Beverly hills | John | 2018 | 0 | 0 |
+----+---------------+-------+------+----------+-------------+
| 2 | Beverly hills | John | 2019 | 1 | 0 |
+----+---------------+-------+------+----------+-------------+
| 3 | Orange county | John | 2019 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 4 | New York | John | 2019 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 10 | Canada | John | 2020 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
| 11 | Canada | John | 2021 | 1 | 0 |
+----+---------------+-------+------+----------+-------------+
| 12 | Beverly hills | John | 2021 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 5 | Canada | Steve | 2018 | 0 | 0 |
+----+---------------+-------+------+----------+-------------+
| 15 | NewYork | Steve | 2018 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
| 16 | California | Steve | 2018 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 6 | Canada | Steve | 2019 | 1 | 0 |
+----+---------------+-------+------+----------+-------------+
| 7 | Canada | Steve | 2019 | 0 | 0 |
+----+---------------+-------+------+----------+-------------+
| 8 | California | Steve | 2020 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
| 9 | Canada | Steve | 2020 | 0 | 1 |
+----+---------------+-------+------+----------+-------------+
| 13 | California | Steve | 2021 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
| 14 | California | Steve | 2022 | 1 | 0 |
+----+---------------+-------+------+----------+-------------+
| 17 | NewYork | Steve | 2022 | 1 | 1 |
+----+---------------+-------+------+----------+-------------+
如果我理解這個問題..
df.drop_duplicates().groupby(['Name','Year']).size().reset_index(name="changes")
有了這個 output
Name Year changes
0 John 2018 1
1 John 2019 3
2 John 2020 1
3 John 2021 2
4 Steve 2018 3
5 Steve 2019 1
6 Steve 2020 2
7 Steve 2021 1
8 Steve 2022 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.