I have the following example dataframe:
df = pd.DataFrame(data = {'RecordID' : [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5], 'DisplayLabel' : ['Source','Test','Value 1','Value 2','Value3','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2'],
'Value' : ['Web','Logic','S','I','Complete','Person','Voice','>20','P','Mail','OCR','A','I','Dictation','Understandable','S','I','Web','Logic','R','S']})
which creates this dataframe:
+-------+----------+---------------+----------------+
| Index | RecordID | Display Label | Value |
+-------+----------+---------------+----------------+
| 0 | 1 | Source | Web |
| 1 | 1 | Test | Logic |
| 2 | 1 | Value 1 | S |
| 3 | 1 | Value 2 | I |
| 4 | 1 | Value 3 | Complete |
| 5 | 2 | Source | Person |
| 6 | 2 | Test | Voice |
| 7 | 2 | Value 1 | >20 |
| 8 | 2 | Value 2 | P |
| 9 | 3 | Source | Mail |
| 10 | 3 | Test | OCR |
| 11 | 3 | Value 1 | A |
| 12 | 3 | Value 2 | I |
| 13 | 4 | Source | Dictation |
| 14 | 4 | Test | Understandable |
| 15 | 4 | Value 1 | S |
| 16 | 4 | Value 2 | I |
| 17 | 5 | Source | Web |
| 18 | 5 | Test | Logic |
| 19 | 5 | Value 1 | R |
| 20 | 5 | Value 2 | S |
+-------+----------+---------------+----------------+
I am trying to "unmelt" though not exactly the source and test columns into new dataframe Columns such that it will look like this:
+-------+----------+-----------+----------------+---------------+----------+
| Index | RecordID | Source | Test | Result | Value |
+-------+----------+-----------+----------------+---------------+----------+
| 0 | 1 | Web | Logic | Value 1 | S |
| 1 | 1 | Web | Logic | Value 2 | I |
| 2 | 1 | Web | Logic | Value 3 | Complete |
| 3 | 2 | Person | Voice | Value 1 | >20 |
| 4 | 2 | Person | Voice | Value 2 | P |
| 5 | 3 | Mail | OCR | Value 1 | A |
| 6 | 3 | Mail | OCR | Value 2 | I |
| 7 | 4 | Dictation | Understandable | Value 1 | S |
| 8 | 4 | Dictation | Understandable | Value 2 | I |
| 9 | 5 | Web | Logic | Value 1 | R |
| 10 | 5 | Web | Logic | Value 2 | S |
+-------+----------+-----------+----------------+---------------+----------+
It's my understanding that pivot and melt will do the entire DisplayLabel column and not just some of the values.
Any help would be greatly appreciated as I have read the Pandas Melt and the Pandas Pivot as well as some references on stackoverflow and I can't seem to figure out a way to do this quickly.
Thanks!
set_index
, unstack
, then melt
df.set_index(['RecordID', 'DisplayLabel']).Value.unstack().reset_index() \
.melt(['RecordID', 'Source', 'Test'], var_name='Result', value_name='Value') \
.sort_values('RecordID').dropna(subset=['Value'])
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
5 1 Web Logic Value 2 I
10 1 Web Logic Value 3 Complete
1 2 Person Voice Value 1 >20
6 2 Person Voice Value 2 P
2 3 Mail OCR Value 1 A
7 3 Mail OCR Value 2 I
3 4 Dictation Understandable Value 1 S
8 4 Dictation Understandable Value 2 I
4 5 Web Logic Value 1 R
9 5 Web Logic Value 2 S
groupby
def f(t):
name, df = t
d = dict(zip(df['DisplayLabel'], df['Value']))
source = d.pop('Source')
test = d.pop('Test')
result, value = zip(*d.items())
return pd.DataFrame(
dict(RecordID=name, Source=source, Test=test, Result=result, Value=value)
)
pd.concat(map(f, df.groupby('RecordID')))
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
1 1 Web Logic Value 2 I
2 1 Web Logic Value 3 Complete
0 2 Person Voice Value 1 >20
1 2 Person Voice Value 2 P
0 3 Mail OCR Value 1 A
1 3 Mail OCR Value 2 I
0 4 Dictation Understandable Value 1 S
1 4 Dictation Understandable Value 2 I
0 5 Web Logic Value 1 R
1 5 Web Logic Value 2 S
df = pd.DataFrame(data={
'RecordID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5],
'DisplayLabel': [
'Source', 'Test', 'Value 1', 'Value 2', 'Value 3',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2',
'Source', 'Test', 'Value 1', 'Value 2'
],
'Value': [
'Web', 'Logic', 'S', 'I', 'Complete',
'Person', 'Voice', '>20', 'P',
'Mail', 'OCR', 'A', 'I',
'Dictation', 'Understandable', 'S', 'I',
'Web', 'Logic', 'R', 'S'
]
})
We can achieve your result by applying logic and pivotting, we split your data by checking if DisplayLabel
contains Value
and then we join
them back together:
mask = df['DisplayLabel'].str.contains('Value')
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')
dfpiv = (
df[mask].rename(columns={'DisplayLabel':'Result'})
.set_index('RecordID')
.join(df2)
.reset_index()
)
RecordID Result Value Source Test
0 1 Value 1 S Web Logic
1 1 Value 2 I Web Logic
2 1 Value3 Complete Web Logic
3 2 Value 1 >20 Person Voice
4 2 Value 2 P Person Voice
5 3 Value 1 A Mail OCR
6 3 Value 2 I Mail OCR
7 4 Value 1 S Dictation Understandable
8 4 Value 2 I Dictation Understandable
9 5 Value 1 R Web Logic
10 5 Value 2 S Web Logic
If you want the exact column order as your example, use DataFrame.reindex
:
dfpiv.reindex(columns=['RecordID', 'Source', 'Test', 'Result', 'Value'])
RecordID Source Test Result Value
0 1 Web Logic Value 1 S
1 1 Web Logic Value 2 I
2 1 Web Logic Value3 Complete
3 2 Person Voice Value 1 >20
4 2 Person Voice Value 2 P
5 3 Mail OCR Value 1 A
6 3 Mail OCR Value 2 I
7 4 Dictation Understandable Value 1 S
8 4 Dictation Understandable Value 2 I
9 5 Web Logic Value 1 R
10 5 Web Logic Value 2 S
# mask all rows where "Value" is in column DisplayLabel
mask = df['DisplayLabel'].str.contains('Value')
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 False
11 True
12 True
13 False
14 False
15 True
16 True
17 False
18 False
19 True
20 True
Name: DisplayLabel, dtype: bool
# select all rows which do NOT have "Value" in DisplayLabel
df[~mask]
RecordID DisplayLabel Value
0 1 Source Web
1 1 Test Logic
5 2 Source Person
6 2 Test Voice
9 3 Source Mail
10 3 Test OCR
13 4 Source Dictation
14 4 Test Understandable
17 5 Source Web
18 5 Test Logic
# pivot the values in DisplayLabel to columns
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')
DisplayLabel Source Test
RecordID
1 Web Logic
2 Person Voice
3 Mail OCR
4 Dictation Understandable
5 Web Logic
df[mask].rename(columns={'DisplayLabel':'Result'}) # rename the column DisplayLabel to Result
.set_index('RecordID') # set RecordId as index so we can join df2
.join(df2) # join df2 back to our dataframe based RecordId
.reset_index() # reset index so we get RecordId back as column
RecordID Result Value Source Test
0 1 Value 1 S Web Logic
1 1 Value 2 I Web Logic
2 1 Value3 Complete Web Logic
3 2 Value 1 >20 Person Voice
4 2 Value 2 P Person Voice
5 3 Value 1 A Mail OCR
6 3 Value 2 I Mail OCR
7 4 Value 1 S Dictation Understandable
8 4 Value 2 I Dictation Understandable
9 5 Value 1 R Web Logic
10 5 Value 2 S Web Logic
I tried a different approach of first pivot
ing using unstack
and then partially converting wide_to_long
(Sorry if it's not efficient but this seems to get the desired output)
# first converting all long to wide
df2 = df.set_index(['RecordID','DisplayLabel']).unstack()
# flattening the unstacked columns
df2.columns = df2.columns.to_series().str.join('_')
df2.columns = df2.columns.str.replace('Value_','',regex=True) #just removing the junk in the column name
df2 = df2.reset_index() #resetting index to access RecordID
df2 = (pd.melt(df2,id_vars=['RecordID',"Source","Test"],var_name='Result', value_name='Value')
.sort_values(['RecordID',"Source","Test"])
.dropna()
.reset_index())
index RecordID Source Test Result Value
0 0 1 Web Logic Value 1 S
1 5 1 Web Logic Value 2 I
2 10 1 Web Logic Value 3 Complete
3 1 2 Person Voice Value 1 >20
4 6 2 Person Voice Value 2 P
5 2 3 Mail OCR Value 1 A
6 7 3 Mail OCR Value 2 I
7 3 4 Dictation Understandable Value 1 S
8 8 4 Dictation Understandable Value 2 I
9 4 5 Web Logic Value 1 R
10 9 5 Web Logic Value 2 S
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.