简体   繁体   中英

How to create a new column of dictionaries based on groupby, pandas DataFrame?

I have the following pandas DataFrame in Python3.x, with two columns of strings.

import pandas as pd

dict1 = {'column1':['MXRBMVQDHF', 'LJNVTJOY', 'WHLAOECVQR'], 
         'column2':['DPBVNJYANX', 'UWRAWDOB', 'CUTQVWHRIJ'], 'start':[79, 31, 52]}

df1 = pd.DataFrame(dict1)
print(df1)

#       column1     column2  start
# 0  MXRBMVQDHF  DPBVNJYANX     79
# 1    LJNVTJOY    UWRAWDOB     31
# 2  WHLAOECVQR  CUTQVWHRIJ     52

Each row contains strings of the same length. These strings are indexed in a particular way, and I am writing a dictionary used to translate between the coordinates. The string in column column1 is 0-based (as expected). The integer in column start is meant to represent the "starting index" of the string in column2 . In the first row, the starting index is 79.

The goal is to create a dictionary based on the indices. So, for the first row, the string in column1 begins at 0 , the string in column2 begins at 79 . The dictionary "converting" these coordinates is as follows:

{0: 79, 1: 80, 2: 81, 3: 82, 4: 83, 5: 84, 6: 85, 7: 86, 8: 87, 9: 88}

My goal is to create a new column in the pandas dataframe with these dictionaries. This is quite straightforward to do (though there's a faster way with .apply() I suspect.):

for index, row in df1.iterrows():
     df1.loc[index,'new'] = [{i: i + row['start'] for i, e in enumerate(row['column1'])}]

Now there is a column in df1 called new :

df1.new
0    {0: 79, 1: 80, 2: 81, 3: 82, 4: 83, 5: 84, 6: ...
1    {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: ...
2    {0: 52, 1: 53, 2: 54, 3: 55, 4: 56, 5: 57, 6: ...
Name: new, dtype: object

My problem is this: let's say there are multiple entries of the same string in column column1 . Here's an example:

import pandas as pd

dict2 = {'column1':['MXRBMVQDHF', 'LJNVTJOY', 'LJNVTJOY', 'LJNVTJOY', 'WHLAOECVQR'], 'column2':['DPBVNJYANX', 'UWRAWDOB', 'PEKUYUQR', 'WPMLFVFZ', 'CUTQVWHRIJ'], 'start':[79, 31, 52, 84, 18]}

df2 = pd.DataFrame(dict2)
print(df2)
#       column1     column2  start
# 0  MXRBMVQDHF  DPBVNJYANX     79
# 1    LJNVTJOY    UWRAWDOB     31
# 2    LJNVTJOY    PEKUYUQR     52
# 3    LJNVTJOY    WPMLFVFZ     84
# 4  WHLAOECVQR  CUTQVWHRIJ     18

In this case, the dictionary for the coordinates with LJNVTJOY should be:

{0: [31, 52, 84], 1: [32, 53, 85], 2: [33, 54, 86], 3: [34, 55, 87], 
     4: [35, 56, 88], 5: [36, 57, 89], 6: [37, 58, 90], 7: [38, 59, 91]}

which is a dictionary of lists based on

{0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37, 7: 38}
{0: 52, 1: 53, 2: 54, 3: 55, 4: 56, 5: 57, 6: 58, 7: 59}
{0: 84, 1: 85, 2: 86, 3: 87, 4: 88, 5: 89, 6: 90, 7: 91}

EDIT: Here is the correct output. There is a DataFrame with the column 'new' such that it looks like the following:

df2.new
0    {0: 79, 1: 80, 2: 81, 3: 82, 4: 83, 5: 84, 6: ...
1    {0: [31, 52, 84], 1: [32, 53, 85], 2: [33, 54, 86], 3: [34, 55, 87], 4: [35, 56, 88], 5: [36, 57, 89], 6: [37, 58, 90], 7: [38, 59, 91]}
2    {0: 52, 1: 53, 2: 54, 3: 55, 4: 56, 5: 57, 6: ...
Name: new, dtype: object

You can using cumcount create the dict key

df2['dictkey']=df2.groupby('column1').cumcount()
df2.groupby('column1').apply(lambda x : dict(zip(x['dictkey'],x['start'])))
Out[94]: 
column1
LJNVTJOY      {0: 31, 1: 52, 2: 84}
MXRBMVQDHF                  {0: 79}
WHLAOECVQR                  {0: 18}
dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM