循環中來自 dataframe 列的連接字符串（Python 3.8）

Question

假設我有一個包含字符串和數字的 DataFrame "DS_df"。 “LAultimateparentcountry”、“borrowerultimateparentcountry”和“tot”三列形成關系。

如何從這三列中創建一個字典（對於整個數據集，而順序很重要）？ 我需要將這兩個國家作為一個變量訪問，而 tot 作為另一個變量。 到目前為止，我已經嘗試過下面的代碼，但這只會給我一個包含單獨項目的列表。 出於某種原因，我也無法 get.join 工作，因為 df 很大（+900k 行）。

new_list =[]

for i, row in DS_df.iterrows():
    new_list.append(row["LAultimateparentcountry"])
    new_list.append(row["borrowerultimateparentcountry"])
    new_list.append(row["tot"])

首選結果是字典，例如，我可以在其中訪問“Germany_Switzerland”：56708。 非常感謝任何幫助或建議。

干杯

Answer 1

你可以這樣使用字典：

countries_map = {}

for index, row in DS_df.iterrows():
    curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
    countries_map[curr_rel] = row["tot"]

如果您不希望不運行現有的鍵值

（並使用他們的首次亮相）：

countries_map = {}
for index, row in DS_df.iterrows():
    curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
    if curr_rel not in countries_map.keys():
        countries_map[curr_rel] = row["tot"]

Answer 2

在 dataframe 上執行操作時，最好按列而不是按行來考慮解決方案。

如果您的 dataframe 有 900k+ 行，那么在 dataframe 上應用矢量化操作可能是一個不錯的選擇。

以下是兩個解決方案：

使用 pd.Series + to_dict()：

pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()

使用 zip() + dict()：

dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))

測試 Dataframe：

    DS_df = DataFrame({
        'LAultimateparentcountry':['India', 'Germany', 'India'],
        'borrowerultimateparentcountry':['France', 'Ireland', 'France'],
        'tot':[56708, 87902, 91211]
    })
DS_df


LAultimateparentcountry borrowerultimateparentcountry   tot
0   India   France  56708
1   Germany Ireland 87902
2   India   France  91211

Output 兩種解決方案：

{'India_France': 91211, 'Germany_Ireland': 87902}

如果形成的鍵有重復，那么值將被更新。

哪種解決方案性能更高？

簡短的回答 -
zip() + dict() # 如果行是大約。 100萬以下
pd.Series + to_dict() # 如果行是大約。 100萬以上

長答案 - 以下是測試：

用 30 行和 3 列測試

壓縮（）+字典（）

%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))

297 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pd.Series + to_dict():

%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()

506 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

使用 6291456 行和 3 列進行測試

pd.Series + to_dict()

%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
3.92 s ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

zip + dict()

%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
3.97 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

循環中來自 dataframe 列的連接字符串（Python 3.8）

問題描述

2 個解決方案

解決方案1
0 已采納 2021-04-21 09:22:47

解決方案2
0 2021-04-21 09:45:08

哪種解決方案性能更高？

循環中來自 dataframe 列的連接字符串（Python 3.8）

問題描述

2 個解決方案

解決方案1 0 已采納 2021-04-21 09:22:47

解決方案2 0 2021-04-21 09:45:08

哪種解決方案性能更高？

解決方案1
0 已采納 2021-04-21 09:22:47

解決方案2
0 2021-04-21 09:45:08