简体   繁体   中英

pandas pivot table for heatmap

I am trying to generate a heatmap using seaborn, however I am having a small problem with the formatting of my data.

Currently, my data is in the form:

Name     Diag   Date
A        1       2006-12-01
A        1       1994-02-12
A        2       2001-07-23
B        2       1999-09-12
B        1       2016-10-12
C        3       2010-01-20
C        2       1998-08-20

I would like to create a heatmap (preferably in python) showing Name on one axis against Diag - if occured. I have tried to pivot the table using pd.pivot , however I was given the error

ValueError: Index contains duplicate entries, cannot reshape

this came from:

piv = df.pivot_table(index='Name',columns='Diag')

Time is irrelevant, but I would like to show which Names have had which Diag , and which Diag combos cluster together. Do I need to create a new table for this or is it possible for that I have? In some cases the Name is not associated with all Diag

EDIT: I have since tried: piv = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')

However as Time is in datetime format, I end up with:
pandas.core.base.DataError: No numeric types to aggregate

You need pivot_table with some aggregate function, because for same index and column have multiple values and pivot need unique values only:

print (df)
  Name  Diag  Time
0    A     1    12 <-duplicates for same A, 1 different value
1    A     1    13 <-duplicates for same A, 1 different value
2    A     2    14
3    B     2    18
4    B     1     1
5    C     3     9
6    C     2     8

df = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
print (df)
Diag     1     2    3
Name                 
A     12.5  14.0  NaN
B      1.0  18.0  NaN
C      NaN   8.0  9.0

Alternative solution:

df = df.groupby(['Name','Diag'])['Time'].mean().unstack()
print (df)
Diag     1     2    3
Name                 
A     12.5  14.0  NaN
B      1.0  18.0  NaN
C      NaN   8.0  9.0

EDIT:

You can also check all duplicates by duplicated :

df = df.loc[df.duplicated(['Name','Diag'], keep=False), ['Name','Diag']]
print (df)
  Name  Diag
0    A     1
1    A     1

EDIT:

mean of datetimes is not easy - need convert dates to nanoseconds , get mean and last convert to datetimes. Also there is another problem - need replace NaN to some scalar, eg 0 what is converted to 0 datetime - 1970-01-01 .

df.Date = pd.to_datetime(df.Date)
df['dates_in_ns'] = pd.Series(df.Date.values.astype(np.int64), index=df.index)
df = df.pivot_table(index='Name',
                    columns='Diag', 
                    values='dates_in_ns', 
                    aggfunc='mean', 
                    fill_value=0)
df = df.apply(pd.to_datetime)
print (df)
Diag                   1          2          3
Name                                          
A    2000-07-07 12:00:00 2001-07-23 1970-01-01
B    2016-10-12 00:00:00 1999-09-12 1970-01-01
C    1970-01-01 00:00:00 1998-08-20 2010-01-20

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM