简体   繁体   English

填充缺失数据并将行转换为 Python Pandas 中的列

[英]Fill missing data and transform rows to column in Python Pandas

I have a dataframe like this,我有一个这样的数据框,

df_nba = pd.DataFrame({'col1': ['name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',

                               ], 
                        'col2': ['jim', 'California','M','Biology','Biology Club',1,
                                 'jim', 'California','M','Physics',2,
                                 'greg', 'Arizona','M','Geography','Jazz Band',3,
                                 'greg', 'Arizona','M','Physics','Photography',4,
                                 'jesse', 'Washington','F','Economics',5,
                                 'jesse', 'Washington','F','Literature','Photography',6,
       
                     ]})

col1    col2
0   name    jim
1   NaN California
2   NaN M
3   course  Biology
4   eca Biology Club
5   pages   1
6   name    jim
7   NaN California
8   NaN M
9   course  Physics
10  pages   2
11  name    greg
12  NaN Arizona
13  NaN M
14  course  Geography
15  eca Jazz Band
16  pages   3
17  name    greg
18  NaN Arizona
19  NaN M
20  course  Physics
21  eca Photography
22  pages   4
23  name    jesse
24  NaN Washington
25  NaN F
26  course  Economics
27  pages   5
28  name    jesse
29  NaN Washington
30  NaN F
31  course  Literature
32  eca Photography
33  pages   6

There are two consecutive rows always missing after the row name for each person.每个人的行name后总是缺少两行连续的行。 Can I fill the data with States and Gender first then I can transpose the data to a column wise view?我可以填补与数据StatesGender第一话,我可以把数据转置到列明智的看法?

The output will be like,输出将是这样的,

        name      states     gender   course           eca           pages
                                      
0       jim      California    M       Biology       Biology Club     1
1       jim      California    M       Physics       NaN              2
2       greg     Arizona       M       Geography     Jazz Band        3
3       greg     Arizona       M       Physics       Photography      4
4      jesse     Washington    F       Economics     NaN              5
5      jesse     Washington    F       Literature    Photography      6

You can use a mask where the value "name" is in col1 and shift to fill the right values in col1.您可以使用值“name”在 col1 中的掩码,并使用shift填充 col1 中的正确值。 Then reshape the result with unstack , after set_index with a cumsum on the mask, incremental value every "name" in col1 and col1 itself.然后与重塑的结果unstack ,后set_indexcumsum在COL1面具,增量值每一个“名”和COL1本身。

#get a mask where name in col1
mask = df_nba['col1'].eq('name')

# fill the two following nan with the rigth value
df_nba.loc[mask.shift(1,fill_value=False), 'col1'] = 'states'
df_nba.loc[mask.shift(2,fill_value=False), 'col1'] = 'gender'

#reshape
df_ = (df_nba.set_index([mask.cumsum(),
                         df_nba['col1'].to_numpy()])
             ['col2'].unstack()
             .rename_axis(None) #cosmetic
             [['name','states','gender','course','eca','pages']] #reorder the columns
      )

print(df_)
    name      states gender      course           eca pages
1    jim  California      M     Biology  Biology Club     1
2    jim  California      M     Physics           NaN     2
3   greg     Arizona      M   Geography     Jazz Band     3
4   greg     Arizona      M     Physics   Photography     4
5  jesse  Washington      F   Economics           NaN     5
6  jesse  Washington      F  Literature   Photography     6

It is not an efficient solution but it can do what you want.这不是一个有效的解决方案,但它可以做你想做的。 if you provide col1 & col2 as lists如果您提供 col1 & col2 作为列表

# to fill missing values in col1
for i in range(1,len(col1)):
    if(col1[i-1] == "name"):
       col1[i] = "states"
    if(col1[i-1] == "states"):
       col1[i] = "gender"

# to create list of dictionaries for each record
data=[]
temp={}
for i in range(len(c1)):
    temp[col1[i]]=col2[i]
    if(col1[i]=="pages"):
        data.append(temp)
        temp={}

pd.DataFrame(data)

You can do the following:您可以执行以下操作:

name_index = df_nba.loc[df_nba['col1']=='name'].index
for i in name_index:
    df_nba.loc[i+1:i+2, 'col1'] = ['states', 'gender']

Now to get the transposed table:现在获取转置表:

pivot = df_nba.pivot(columns = 'col1')
pivot_nba = pd.DataFrame()
for col in pivot['col2']:
    pivot_nba[col] = pivot['col2'][col].dropna().reset_index(drop = True)
pivot_nba

    course        eca               gender  name    pages   states
0   Biology       Biology Club      M       jim     1       California
1   Physics       Jazz Band         M       jim     2       California
2   Geography     Photography       M       greg    3       Arizona
3   Physics       Photography       M       greg    4       Arizona
4   Economics     NaN               F       jesse   5       Washington
5   Literature    NaN               F       jesse   6       Washington

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM