[英]Fill missing data and transform rows to column in Python Pandas
I have a dataframe like this,我有一个这样的数据框,
df_nba = pd.DataFrame({'col1': ['name', np.nan,np.nan,'course','eca','pages',
'name', np.nan,np.nan,'course','pages',
'name', np.nan,np.nan,'course','eca','pages',
'name', np.nan,np.nan,'course','eca','pages',
'name', np.nan,np.nan,'course','pages',
'name', np.nan,np.nan,'course','eca','pages',
],
'col2': ['jim', 'California','M','Biology','Biology Club',1,
'jim', 'California','M','Physics',2,
'greg', 'Arizona','M','Geography','Jazz Band',3,
'greg', 'Arizona','M','Physics','Photography',4,
'jesse', 'Washington','F','Economics',5,
'jesse', 'Washington','F','Literature','Photography',6,
]})
col1 col2
0 name jim
1 NaN California
2 NaN M
3 course Biology
4 eca Biology Club
5 pages 1
6 name jim
7 NaN California
8 NaN M
9 course Physics
10 pages 2
11 name greg
12 NaN Arizona
13 NaN M
14 course Geography
15 eca Jazz Band
16 pages 3
17 name greg
18 NaN Arizona
19 NaN M
20 course Physics
21 eca Photography
22 pages 4
23 name jesse
24 NaN Washington
25 NaN F
26 course Economics
27 pages 5
28 name jesse
29 NaN Washington
30 NaN F
31 course Literature
32 eca Photography
33 pages 6
There are two consecutive rows always missing after the row name
for each person.每个人的行name
后总是缺少两行连续的行。 Can I fill the data with States
and Gender
first then I can transpose the data to a column wise view?我可以填补与数据States
和Gender
第一话,我可以把数据转置到列明智的看法?
The output will be like,输出将是这样的,
name states gender course eca pages
0 jim California M Biology Biology Club 1
1 jim California M Physics NaN 2
2 greg Arizona M Geography Jazz Band 3
3 greg Arizona M Physics Photography 4
4 jesse Washington F Economics NaN 5
5 jesse Washington F Literature Photography 6
You can use a mask where the value "name" is in col1 and shift
to fill the right values in col1.您可以使用值“name”在 col1 中的掩码,并使用shift
填充 col1 中的正确值。 Then reshape the result with unstack
, after set_index
with a cumsum
on the mask, incremental value every "name" in col1 and col1 itself.然后与重塑的结果unstack
,后set_index
用cumsum
在COL1面具,增量值每一个“名”和COL1本身。
#get a mask where name in col1
mask = df_nba['col1'].eq('name')
# fill the two following nan with the rigth value
df_nba.loc[mask.shift(1,fill_value=False), 'col1'] = 'states'
df_nba.loc[mask.shift(2,fill_value=False), 'col1'] = 'gender'
#reshape
df_ = (df_nba.set_index([mask.cumsum(),
df_nba['col1'].to_numpy()])
['col2'].unstack()
.rename_axis(None) #cosmetic
[['name','states','gender','course','eca','pages']] #reorder the columns
)
print(df_)
name states gender course eca pages
1 jim California M Biology Biology Club 1
2 jim California M Physics NaN 2
3 greg Arizona M Geography Jazz Band 3
4 greg Arizona M Physics Photography 4
5 jesse Washington F Economics NaN 5
6 jesse Washington F Literature Photography 6
It is not an efficient solution but it can do what you want.这不是一个有效的解决方案,但它可以做你想做的。 if you provide col1 & col2 as lists如果您提供 col1 & col2 作为列表
# to fill missing values in col1
for i in range(1,len(col1)):
if(col1[i-1] == "name"):
col1[i] = "states"
if(col1[i-1] == "states"):
col1[i] = "gender"
# to create list of dictionaries for each record
data=[]
temp={}
for i in range(len(c1)):
temp[col1[i]]=col2[i]
if(col1[i]=="pages"):
data.append(temp)
temp={}
pd.DataFrame(data)
You can do the following:您可以执行以下操作:
name_index = df_nba.loc[df_nba['col1']=='name'].index
for i in name_index:
df_nba.loc[i+1:i+2, 'col1'] = ['states', 'gender']
Now to get the transposed table:现在获取转置表:
pivot = df_nba.pivot(columns = 'col1')
pivot_nba = pd.DataFrame()
for col in pivot['col2']:
pivot_nba[col] = pivot['col2'][col].dropna().reset_index(drop = True)
pivot_nba
course eca gender name pages states
0 Biology Biology Club M jim 1 California
1 Physics Jazz Band M jim 2 California
2 Geography Photography M greg 3 Arizona
3 Physics Photography M greg 4 Arizona
4 Economics NaN F jesse 5 Washington
5 Literature NaN F jesse 6 Washington
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.