简体   繁体   English

如何将同一类别的多行合并为大熊猫?

[英]How to combine multiple rows of same category to one in pandas?

I'm trying to get from table 1 to table 2 from the image but I can't seem to get it right. 我正在尝试从表1从表1到表2,但似乎无法正确处理。 I tried pivot table to change col A - D from rows to cols. 我尝试通过数据透视表将A-D列从行更改为列。 Then I try groupby but it doesn't give me one row but messes up my dataframe instead. 然后,我尝试groupby,但是它没有给我一行,而是弄乱了我的数据框。

在此处输入图片说明

You can fill the null values with the value in the column and drop duplicates: 您可以使用列中的值填充空值,然后删除重复项:

with : 与:

df = pd.DataFrame([["A", pd.np.nan, pd.np.nan, "Y", "Z"],
              [pd.np.nan, "B", pd.np.nan, "Y", "Z"],
              [pd.np.nan,pd.np.nan, "C", "Y", "Z"]], columns=list("ABCDE"))
df
     A    B    C  D  E
0    A  NaN  NaN  Y  Z
1  NaN    B  NaN  Y  Z
2  NaN  NaN    C  Y  Z

df.ffill().bfill().drop_duplicates()
   A  B  C  D  E
0  A  B  C  Y  Z

df.ffill().bfill() gives: df.ffill().bfill()给出:

   A  B  C  D  E
0  A  B  C  Y  Z
1  A  B  C  Y  Z
2  A  B  C  Y  Z

As per your comment, you could define a function that fill the missing value of the first row by the unique value that lies somewhere else in the same column. 根据您的评论,您可以定义一个函数,该函数用位于同一列中其他位置的唯一值填充第一行的缺失值。

def fillna_uniq(df, col):
    if isinstance(col, list):
        for c in col:
            df.loc[df.index[0], c] = df[c].dropna().iloc[0]
    else:
        df.loc[df.index[0], col] = df[col].dropna().iloc[0]
    return df.iloc[[0]]

You could then do: 然后,您可以执行以下操作:

fillna_uniq(df.copy(), ["B", "C", "D"])
       A  B   C     D       E     F
0  Hello  I  am  lost  Pandas  Data

It is a bit faster I think. 我认为这要快一些。 You can modify your df inplace by passing directly the dataframe, not a copy. 您可以通过直接传递数据框而不是副本来直接修改df。

HTH HTH

One way you can do this is using apply and dropna : 一种方法是使用applydropna

Assuming those blanks in your table above are really nulls: 假设上表中的空白为空:

df = pd.DataFrame({'A':['Hello',np.nan,np.nan,np.nan],'B':[np.nan,'I',np.nan,np.nan],
                   'C':[np.nan,np.nan,'am',np.nan],
                  'D':[np.nan,np.nan,np.nan,'lost'],
                  'E':['Pandas']*4,
                  'F':['Data']*4})

print(df)
       A    B    C     D       E     F
0  Hello  NaN  NaN   NaN  Pandas  Data
1    NaN    I  NaN   NaN  Pandas  Data
2    NaN  NaN   am   NaN  Pandas  Data
3    NaN  NaN  NaN  lost  Pandas  Data

Using apply , you can apply the lambda function to each column of the dataframe, first dropping null values then find the max: 使用apply ,可以将lambda函数应用于数据框的每一列,首先删除空值,然后找到最大值:

df.apply(lambda x: x.dropna().max()).to_frame().T

       A  B   C     D       E     F
0  Hello  I  am  lost  Pandas  Data

Or if your blanks are really empty strings, then you can do this: 或者,如果您的空格是真正的空字符串,则可以执行以下操作:

df1 = df.replace(np.nan,'')
df1
       A  B   C     D       E     F
0  Hello               Pandas  Data
1         I            Pandas  Data
2            am        Pandas  Data
3                lost  Pandas  Data

df1.apply(lambda x: x[x!=''].max()).to_frame().T

       A  B   C     D       E     F
0  Hello  I  am  lost  Pandas  Data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM