I´d like to learn how to data frame column as code maped from multiple columns.
In the partial example below I was trying what would could be a clumsy way folowing the path: get unique values as a temporary data frame; concatenate some prefix string to temp row number as a new column and them join the 2 data frames.
df = pd.DataFrame({'col1' : ['A1', 'A2', 'A1', 'A3'],
'col2' : ['B1', 'B2', 'B1', 'B1'],
'value' : [100, 200, 300, 400],
})
tmp = df[['col1','col2']].drop_duplicates(['col1', 'col2'])
# col1 col2
# 0 A1 B1
# 1 A2 B2
# 3 A3 B1
The first question is how to get 'temp' row number and its value to a tmp column?
And what is the clever pythonic way to achieve the result below from df?
dfnew = pd.DataFrame({'col1' : ['A1', 'A2', 'A1', 'A3'],
'col2' : ['B1', 'B2', 'B1', 'B1'],
'code' : ['CODE0','CODE1', 'CODE0', 'CODE3'],
'value' : [100, 200, 300, 400],
})
code col1 col2 value
0 CODE0 A1 B1 100
1 CODE1 A2 B2 200
2 CODE0 A1 B1 300
3 CODE3 A3 B1 400
thanks.
After the answers and just as an exercise I kept working on the non-pythonic version I had in mind with insights I got from great answers, and reached this:
tmp = df[['col1','col2']].drop_duplicates(['col1', 'col2'])
tmp.reset_index(inplace=True)
tmp.drop('index', axis=1, inplace=True)
tmp['code'] = tmp.index.to_series().apply(lambda x: 'code' + format(x, '04d'))
dfnew = pd.merge(df, tmp, on=['col1', 'col2'])
At the time of posting this question, I did not realize that would be nicer to have the index reset to have a fresh sequence instead of their original index numbers.
I tried some variations but I did not get how to chain 'reset_index' and 'drop' in just one command.
I´m starting to enjoy Python. Thank you all.
You can first sort_values
of columns col1
and col2
where by duplicated
find all duplicates:
df = df.sort_values(['col1', 'col2'])
mask = df.duplicated(['col1','col2'])
print (mask)
0 False
2 True
1 False
3 False
dtype: bool
Then use insert
if need specify position of output column code
with numpy.where
and fillna
missing values. Last sort_index
:
df.insert(0, 'code', np.where(mask, np.nan, 'CODE' + df.index.astype(str)))
df.code = df.code.ffill()
df = df.sort_index()
print (df)
code col1 col2 value
0 CODE0 A1 B1 100
1 CODE1 A2 B2 200
2 CODE0 A1 B1 300
3 CODE3 A3 B1 400
How to get 'temp' row number and its value to a tmp column?
Value column is not propagating because you filter it out at the beginning: df[['col1','col2']]
. Hence, this is fixed by changing it to tmp = df.drop_duplicates(['col1', 'col2'])
.
Index is preserved in the index column, if you want to copy it explicitly into data column, just do tmp['index'] = tmp.index
.
What is the clever pythonic way to achieve the result below from df?
I do not know if it is particularly clever or not, as this is subjective, but one way of achieving that is
pd.concat([gr.assign(code='CODE{}'.format(min(gr.index))) for _, gr in df.groupby(['col1', 'col2'])])
Finally, to achieve the result in a form you specified, you can add .sort_index()
and [['code', 'col1', 'col2', 'value']]
to the above, in order to specify ordering of columns. Giving:
newdf = pd.concat([gr.assign(code='CODE{}'.format(min(gr.index))) for _, gr in df.groupby(['col1', 'col2'])]).sort_index()[['code', 'col1', 'col2', 'value']]
Possible performance bottleneck may be groupby
and concat
which may matter if you operate on large data sets.
groupby
on df.index
with ['col1', 'col2']
using transform('first')
and map
df.assign(
code=df.index.to_series().groupby(
[df.col1, df.col2]
).transform('first').map('CODE{}'.format)
)[['code'] + df.columns.tolist()]
code col1 col2 value
0 CODE0 A1 B1 100
1 CODE1 A2 B2 200
2 CODE0 A1 B1 300
3 CODE3 A3 B1 400
explanation
# turn index to series so I can perform a groupby on it
idx_series = df.index.to_series()
# groupby col1 and col2 to establish uniqueness
idx_gb = idx_series.groupby([df.col1, df.col2])
# get first index value in each unique group
# and broadcast over entire group with transform
idx_tf = idx_gb.transform('first')
# map a format function to get desired string
code = idx_tf.map('code{}'.format)
# use assign to create new column
df.assign(code=code)
If you have df
DataFrame like this:
state year population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561
you can create indexes from state
and year
columns with:
df2 = df.set_index(['state','year'])
which will give you dataframe with multi-index constructed from columns state
and year
:
Accessing Multindexed dataframe
df['California',2000]
Result: 33871648
df[:,2010]
Result:
state
California 37253956
New York 19378102
Texas 25145561
dtype: int64
pop.loc['California':'New York']
Result:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.