循环中来自 dataframe 列的连接字符串（Python 3.8）

Question

Suppose I have a DataFrame "DS_df" containing strings ands numbers.假设我有一个包含字符串和数字的 DataFrame "DS_df"。 The three columns "LAultimateparentcountry", "borrowerultimateparentcountry" and "tot" form a relationship. “LAultimateparentcountry”、“borrowerultimateparentcountry”和“tot”三列形成关系。

How can I create a dictionary out of those three columns (for the entire dataset, while order matters)?如何从这三列中创建一个字典（对于整个数据集，而顺序很重要）？ I would need to access the two countries as one variable, and tot as another.我需要将这两个国家作为一个变量访问，而 tot 作为另一个变量。 I've tried the code below so far, but this merely yields me a list with separate items.到目前为止，我已经尝试过下面的代码，但这只会给我一个包含单独项目的列表。 For some reason, I am also not able to get.join to work, as the df is quite big (+900k rows).出于某种原因，我也无法 get.join 工作，因为 df 很大（+900k 行）。

new_list =[]

for i, row in DS_df.iterrows():
    new_list.append(row["LAultimateparentcountry"])
    new_list.append(row["borrowerultimateparentcountry"])
    new_list.append(row["tot"])

Preferred outcome would be a dictionary, where I could access "Germany_Switzerland": 56708 for example.首选结果是字典，例如，我可以在其中访问“Germany_Switzerland”：56708。 Any help or advice is much appreciated.非常感谢任何帮助或建议。

Cheers干杯

Answer 1

You can use a dict this way:你可以这样使用字典：

countries_map = {}

for index, row in DS_df.iterrows():
    curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
    countries_map[curr_rel] = row["tot"]

If you are not wishing to not run over existing keys values如果您不希望不运行现有的键值

(and use their first appearance): （并使用他们的首次亮相）：

countries_map = {}
for index, row in DS_df.iterrows():
    curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
    if curr_rel not in countries_map.keys():
        countries_map[curr_rel] = row["tot"]

Answer 2

When performing operations on a dataframe it's always good to think for a solution column-wise and not row-wise.在 dataframe 上执行操作时，最好按列而不是按行来考虑解决方案。

If your dataframe is having 900k+ rows then it might be a good option to apply vectorized operations on dataframe.如果您的 dataframe 有 900k+ 行，那么在 dataframe 上应用矢量化操作可能是一个不错的选择。

Below are two solutions:以下是两个解决方案：

Using pd.Series + to_dict():使用 pd.Series + to_dict()：

pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()

Using zip() + dict():使用 zip() + dict()：

dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))

Test Dataframe:测试 Dataframe：

    DS_df = DataFrame({
        'LAultimateparentcountry':['India', 'Germany', 'India'],
        'borrowerultimateparentcountry':['France', 'Ireland', 'France'],
        'tot':[56708, 87902, 91211]
    })
DS_df


LAultimateparentcountry borrowerultimateparentcountry   tot
0   India   France  56708
1   Germany Ireland 87902
2   India   France  91211

Output of both solutions: Output 两种解决方案：

{'India_France': 91211, 'Germany_Ireland': 87902}

If the formed key has duplicates then the value will be updated.如果形成的键有重复，那么值将被更新。

Which solution is more performant?哪种解决方案性能更高？

short answer -简短的回答 -
zip() + dict() # if the rows are approx. zip() + dict() # 如果行是大约。 below 1000000 100万以下
pd.Series + to_dict() # if the rows are approx. pd.Series + to_dict() # 如果行是大约。 above 1000000 100万以上

Long answer - Below are the tests:长答案 - 以下是测试：

Test with 30 rows and 3 columns用 30 行和 3 列测试

zip() + dict()压缩（）+字典（）

%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))

297 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pd.Series + to_dict(): pd.Series + to_dict():

%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()

506 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Test with 6291456 rows and 3 columns使用 6291456 行和 3 列进行测试

pd.Series + to_dict() pd.Series + to_dict()

%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
3.92 s ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

zip + dict() zip + dict()

%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
3.97 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

循环中来自 dataframe 列的连接字符串（Python 3.8）

问题描述

2 个解决方案

解决方案1
0 已采纳 2021-04-21 09:22:47

解决方案2
0 2021-04-21 09:45:08

Which solution is more performant?哪种解决方案性能更高？

循环中来自 dataframe 列的连接字符串（Python 3.8）

问题描述

2 个解决方案

解决方案1 0 已采纳 2021-04-21 09:22:47

解决方案2 0 2021-04-21 09:45:08

Which solution is more performant?哪种解决方案性能更高？

解决方案1
0 已采纳 2021-04-21 09:22:47

解决方案2
0 2021-04-21 09:45:08