根據列值拆分 pandas dataframe

Question

給定以下數據：

test_data = pd.DataFrame({
    "col": ["wall", "wall", "lamp", "lamp", "desk", "desk", "desk",
            "mug", "floor"],
    })

我想根據給定列中的特定值（在本例中為col ）創建三個（兩個用於邊緣情況）數據集。

例如，如果給出col = lamp的值，我會期望：

df 1
| col   |
|:------|
| wall  |
| wall  |

df 2
| col   |
|:------|
| lamp  |
| lamp  |

df 3
| col   |
|:------|
| desk  |
| desk  |
| desk  |
| mug   |
| floor |

我嘗試過使用以下內容：

match_str = "mug"

match_start, match_end = (
    test_data["col"].eq(match_str).loc[lambda x: x].index.min(),
    test_data["col"].eq(match_str).loc[lambda x: x].index.max(),
)

df1_filt = pd.Series(test_data.index).lt(match_start)
df2_filt = pd.Series(test_data.index).between(match_start, match_end)
df3_filt = pd.Series(test_data.index).gt(match_end)

df1, df2, df3 = (
    test_data.loc[df1_filt],
    test_data.loc[df2_filt],
    test_data.loc[df3_filt],
)

這似乎可以處理要求 - 它假設col是有序的，但如果它沒有被訂購，那么這個操作無論如何都沒有任何意義。

Answer 1

這就像itertools.groupby的行為，對吧？ 我們需要對彼此相鄰的事物進行分組，並取決於它們是否等於搜索值。 所以在 pandas 中模仿 Python 的 groupby 是 "diff-ne(0)-cumsum" 成語，所以這里我們使用 go：

In [301]: df
Out[301]:
     col
0   wall
1   wall
2   lamp
3   lamp
4   desk
5   desk
6   desk
7    mug
8  floor

In [302]: [sub_frame
           for _, sub_frame in df.groupby(df.col.eq("lamp").diff().ne(0).cumsum())]
Out[302]:
[    col
 0  wall
 1  wall,
     col
 2  lamp
 3  lamp,
      col
 4   desk
 5   desk
 6   desk
 7    mug
 8  floor]

它給出了 3 個數據幀的列表：在“燈流”之前、在燈 stream 期間和之后。 這也將尊重邊緣情況。

Answer 2

每當您看到自己試圖動態地將某些東西拆分為未知數量的變量時，它可能會引發一個危險信號。 我建議在數據集中創建一個組標志，然后使用它來分組或迭代。

import pandas as pd
test_data = pd.DataFrame(
    {
        "col": ["wall", "wall", "lamp", "lamp", "desk", "desk", "desk", "mug", "floor"],
    }
)

test_data['group'] = test_data['col'].eq('mug').diff().ne(0).cumsum()
print(test_data)

Output

     col  group
0   wall      1
1   wall      1
2   lamp      1
3   lamp      1
4   desk      1
5   desk      1
6   desk      1
7    mug      2
8  floor      3

如果您出於某種原因必須拆分它們，至少使用字典來存儲它們，以便您可以處理返回的各種數量的數據幀。

import pandas as pd
    test_data = pd.DataFrame(
        {
            "col": ["wall", "wall", "lamp", "lamp", "desk", "desk", "desk", "mug", "floor"],
        }
    )

output = {group:data for group,data in test_data.groupby(test_data['col'].eq('mug').diff().ne(0).cumsum())}

print(output[2])

結果

   col
7  mug

Answer 3

match_str = 'lamp'
#breaking point
bp = test_data.loc[test_data['col'] == match_str, :].index

#before bp(smaller than bk's head)
b_bp = test_data.index < bp[0]

#after bp(greater than bk's tail)
a_bp = test_data.index >bp[-1]

df_1 = test_data.iloc[b_bp]
df_1
###
    col
0  wall
1  wall

df2 = test_data.iloc[bp]
df2
###
    col
2  lamp
3  lamp

df3 = test_data.iloc[a_bp]
df3
###
     col
4   desk
5   desk
6   desk
7    mug
8  floor

根據列值拆分 pandas dataframe

問題描述

3 個解決方案

解決方案1
1 2022-08-13 16:27:53

解決方案2
1 2022-08-13 16:28:40

解決方案3
0 2022-08-13 16:18:21

根據列值拆分 pandas dataframe

問題描述

3 個解決方案

解決方案1 1 2022-08-13 16:27:53

解決方案2 1 2022-08-13 16:28:40

解決方案3 0 2022-08-13 16:18:21

解決方案1
1 2022-08-13 16:27:53

解決方案2
1 2022-08-13 16:28:40

解決方案3
0 2022-08-13 16:18:21