简体   繁体   中英

Concat strings from dataframe columns in a loop (Python 3.8)

Suppose I have a DataFrame "DS_df" containing strings ands numbers. The three columns "LAultimateparentcountry", "borrowerultimateparentcountry" and "tot" form a relationship.

How can I create a dictionary out of those three columns (for the entire dataset, while order matters)? I would need to access the two countries as one variable, and tot as another. I've tried the code below so far, but this merely yields me a list with separate items. For some reason, I am also not able to get.join to work, as the df is quite big (+900k rows).

new_list =[]

for i, row in DS_df.iterrows():
    new_list.append(row["LAultimateparentcountry"])
    new_list.append(row["borrowerultimateparentcountry"])
    new_list.append(row["tot"])

Preferred outcome would be a dictionary, where I could access "Germany_Switzerland": 56708 for example. Any help or advice is much appreciated.

Cheers

You can use a dict this way:

countries_map = {}

for index, row in DS_df.iterrows():
    curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
    countries_map[curr_rel] = row["tot"]

If you are not wishing to not run over existing keys values

(and use their first appearance):

countries_map = {}
for index, row in DS_df.iterrows():
    curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
    if curr_rel not in countries_map.keys():
        countries_map[curr_rel] = row["tot"]

When performing operations on a dataframe it's always good to think for a solution column-wise and not row-wise.

If your dataframe is having 900k+ rows then it might be a good option to apply vectorized operations on dataframe.

Below are two solutions:

Using pd.Series + to_dict():

pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()

Using zip() + dict():

dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))

Test Dataframe:

    DS_df = DataFrame({
        'LAultimateparentcountry':['India', 'Germany', 'India'],
        'borrowerultimateparentcountry':['France', 'Ireland', 'France'],
        'tot':[56708, 87902, 91211]
    })
DS_df


LAultimateparentcountry borrowerultimateparentcountry   tot
0   India   France  56708
1   Germany Ireland 87902
2   India   France  91211

Output of both solutions:

{'India_France': 91211, 'Germany_Ireland': 87902}

If the formed key has duplicates then the value will be updated.

Which solution is more performant?

short answer -
zip() + dict() # if the rows are approx. below 1000000
pd.Series + to_dict() # if the rows are approx. above 1000000

Long answer - Below are the tests:

Test with 30 rows and 3 columns

zip() + dict()

%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))

297 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pd.Series + to_dict():

%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()

506 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Test with 6291456 rows and 3 columns

pd.Series + to_dict()

%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
3.92 s ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

zip + dict()

%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
3.97 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM