简体   繁体   English

Pandas 数据帧,重复的行只有两列具有唯一信息,将这些列移动到上一行的新列

[英]Pandas data frame, duplicated rows with only two columns with unique information, move these columns to new columns in previous row

I am working with a few csv datasets in order to create a new synthesized output that tells the user what data types from certain surveys need to be archived.我正在使用一些 csv 数据集,以创建一个新的合成 output,它告诉用户需要归档来自某些调查的哪些数据类型。 After some normalizing and merges, I'm left with two final data frames to merge:经过一些规范化和合并后,我剩下两个最终数据框要合并:

df1
    Cruise ID   needs_ctd   needs_adcp
0   1505          FALSE         TRUE
1   1506          FALSE         TRUE

df2
    Cruise ID   needs_wc    WC Instrument
0   NF1505         TRUE         EM710
1   NF1505         TRUE         Reson7125
2   NF1506         TRUE         EK60

Currently, I'm merging using: df_out = df1.merge(df2, how="left", on="Cruise ID")目前,我正在合并使用: df_out = df1.merge(df2, how="left", on="Cruise ID")

Which gives the following result:这给出了以下结果:

df_out
    Cruise ID   needs_ctd   needs_adcp  needs_wc    WC Instrument 
0   1505           FALSE        TRUE      TRUE          EM710
1   1505           FALSE        TRUE      TRUE          Reson7125
2   1506           FALSE        TRUE      TRUE          EK60

The problem here is that it can create some confusion for the user who might get confused about why "needs_adcp" is being repeated on two lines.这里的问题是,它可能会给用户造成一些困惑,他们可能会对为什么“needs_adcp”在两行重复出现感到困惑。 So I'd like to instead be able to move the second WC Instrument information to new columns for 1505.所以我希望能够将第二个 WC Instrument 信息移动到 1505 的新列。

What I'd like to see instead:我想看到的是:

df_out
    Cruise ID   needs_ctd   needs_adcp  needs_wc    WC Instrument   needs_wc2   WC Instrument2
0   1505           FALSE        TRUE      TRUE          EM710        TRUE           Reson7125
1   1506           FALSE        TRUE      TRUE          EK60    

Thank you for your help!谢谢您的帮助!

I don't think it's possible to have two columns with the same names "WC Instrument" in dataframe.我认为在 dataframe 中不可能有两个具有相同名称“WC Instrument”的列。 Maybe combine EM710 and Reson7125 as a list in your df2 so that you have a unique Cruise IDs in df2.也许将 EM710 和 Reson7125 组合为您的 df2 中的列表,以便您在 df2 中拥有唯一的 Cruise ID。

how to combine EM710 and Reson7125 into a list.如何将 EM710 和 Reson7125 组合成一个列表。 How to use groupby to concatenate strings in python pandas? 如何使用 groupby 连接 python pandas 中的字符串?

If you are OK to rename the second (and possibly the following) WC Instrument and needs_wc columns you can do something like (I have to admit its a bit far-fetched and there might be a more elegant way to do it):如果您可以重命名第二个(可能是以下) WC Instrumentneeds_wc列,您可以执行类似的操作(我不得不承认它有点牵强,可能有更优雅的方法来做到这一点):

df2_reindex = df2.set_index(['Cruise ID', df2.groupby('Cruise ID').cumcount()])
df1.merge(
    df2_reindex['WC Instrument']
         .unstack(fill_value='')
         .add_prefix('WC Instrument_')
         .reset_index()).merge(
        df2_reindex['needs_wc']
             .unstack(fill_value='')
             .add_prefix('needs_wc_')
             .reset_index())

Outputs as expected:预期的输出:

   Cruise ID  needs_ctd  needs_adcp  ... WC Instrument_1 needs_wc_0 needs_wc_1
0      1505      False        True  ...      Reson7125       True       True
1      1506      False        True  ...                      True           

Note that you can make it work without knowing in advance the name of the columns that may need to be created with a suffix with something like this:请注意,您可以在事先不知道可能需要使用后缀创建的列名称的情况下使其工作,如下所示:

res = df1.copy()
df2_reindex = df2.set_index(['Cruise ID', df2.groupby('Cruise ID').cumcount()])
for col in df2_reindex.columns:
    res = res.merge(
        df2_reindex[col]
            .unstack(fill_value='')
            .add_prefix(col + '_')
            .reset_index())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM