简体   繁体   English

如何重新排列 dataframe 的行,以便每行以相同的字符串开头

[英]How to rearrange the rows of a dataframe so that each row starts with the same string

I have this dataframe:我有这个 dataframe:

mp4             mp3              txt               csv
123IT_DB1.mp4   123IT_DB1.mp3    123IT_DB1.txt     123IT_FDG_DB1.csv
NaN             123IT_DB1_2.mp3  NaN               NaN
123IT_DB1_2.mp4 NaN              NaN               NaN
NaN             NaN              123IT_DB_2.txt    NaN
NaN             NaN              NaN               123IT_GUY_DB1_2.csv
234IT_DB1.mp4   NaN              234IT_DB1.txt     234IT_FDG_DB1.csv 
234IT_DB1_2.mp4 234IT_DB1.mp3    NaN               NaN
345IT_DB1.mp4   345IT_DB1.mp3    345IT_DB1.txt     345IT_FDG_DB1.csv    
345IT_DB1_2.mp4 345IT_DB1_2.mp3  NaN               NaN
345IT_DB1_3.mp4 NaN              NaN               NaN
456IT_DB1.mp4   456IT_DB1.mp3    456IT_DB1.txt     456_DB1.csv

I want to rearrange this dataframe so that all values that start with the same split at the first underscore are on the same row.我想重新排列这个 dataframe 以便在第一个下划线处以相同拆分开头的所有值都在同一行。 However, if there are more than one values that start with said string, then that row should only contain that element and the rest of the columns should be blank.但是,如果有多个值以所述字符串开头,则该行应仅包含该元素,并且列的 rest 应为空白。 The resulting input should look like this:结果输入应如下所示:

    mp4             mp3             txt               csv
    123IT_DB1.mp4   123IT_DB1.mp3   123IT_DB1.txt     123IT_FDG_DB1.csv
    123IT_DB1_2.mp4 123IT_DB1_2.mp3 123IT_DB_2.txt    123IT_2_DB1.csv
    234IT_DB1.mp4   234IT_DB1.mp3   234IT_DB1.txt     234IT_FDG_DB1.csv 
    234IT_DB1_2.mp4 NaN             NaN               NaN
    345IT_DB1.mp4   345IT_DB1.mp3   345IT_DB1.txt     345IT_FDG_DB1.csv    
    345IT_DB1_2.mp4 345IT_DB1_2.mp3 NaN               NaN
    345IT_DB1_3     NaN             NaN               NaN
    456IT_DB1.mp4   456IT_DB1.mp3   456IT_DB1.txt     456_DB1.csv

As you can see, I can't just delete the NaN's because I need some of them to stay.如您所见,我不能只删除 NaN,因为我需要其中一些保留。 Any help would be much appreciated.任何帮助将非常感激。

To get to your target到达你的目标

  • reshape table to a simple set of rows将表格重塑为一组简单的行
  • generate grouping columns, h part before first underscore, t part after last underscore生成分组列, h部分在第一个下划线之前, t部分在最后一个下划线之后
  • sort and groupby() these then use cumcount to get an incremental number for each grouped file sort 和groupby()然后使用cumcount获取每个分组文件的增量编号
  • reshape back into a table重新塑造成一张桌子
import io
df = pd.read_csv(io.StringIO("""1               2                3                  4
123IT_DB1.mp4   123IT_DB1.mp3    123IT_DB1.txt     123IT_FDG_DB1.csv
NaN             123IT_DB1_2.mp3  NaN               NaN
123IT_DB1_2.mp4 NaN              NaN               NaN
NaN             NaN              123IT_DB_2.txt    NaN
NaN             NaN              NaN               123IT_GUY_DB1_2.csv
234IT_DB1.mp4   NaN              234IT_DB1.txt     234IT_FDG_DB1.csv 
234IT_DB1_2.mp4 234IT_DB1.mp3    NaN               NaN
345IT_DB1.mp4   345IT_DB1.mp3    345IT_DB1.txt     345IT_FDG_DB1.csv    
345IT_DB1_2.mp4 345IT_DB1_2.mp3  NaN               NaN
345IT_DB1_3.mp4 NaN              NaN               NaN
456IT_DB1.mp4   456IT_DB1.mp3    456IT_DB1.txt     456_DB1.csv"""), sep="\s+")

# change from a table to a list, create columns that are the head & tail
df2 = df.rename_axis("col", axis=1).unstack().reset_index(drop=True).dropna().apply(lambda s: {
    "h":s.split(".")[0].split("_")[0],
    "t":s.split(".")[0].split("_")[-1],
    "o":s}).apply(pd.Series).sort_values(["h","t","o"])

# work out ordering of file,  then transform back into a table
df2 = df2.assign(col=df2.groupby(["h","t"])["o"].transform("cumcount") + 1).set_index(["col","h","t"]).unstack(0).reset_index(drop=True).droplevel(0, axis=1)

output output

1 1 2 2 3 3 4 4
0 0 123IT_DB1_2.mp3 123IT_DB1_2.mp3 123IT_DB1_2.mp4 123IT_DB1_2.mp4 123IT_DB_2.txt 123IT_DB_2.txt 123IT_GUY_DB1_2.csv 123IT_GUY_DB1_2.csv
1 1 123IT_DB1.mp3 123IT_DB1.mp3 123IT_DB1.mp4 123IT_DB1.mp4 123IT_DB1.txt 123IT_DB1.txt 123IT_FDG_DB1.csv 123IT_FDG_DB1.csv
2 2 234IT_DB1_2.mp4 234IT_DB1_2.mp4 nan nan nan
3 3 234IT_DB1.mp3 234IT_DB1.mp3 234IT_DB1.mp4 234IT_DB1.mp4 234IT_DB1.txt 234IT_DB1.txt 234IT_FDG_DB1.csv 234IT_FDG_DB1.csv
4 4 345IT_DB1_2.mp3 345IT_DB1_2.mp3 345IT_DB1_2.mp4 345IT_DB1_2.mp4 nan nan
5 5 345IT_DB1_3.mp4 345IT_DB1_3.mp4 nan nan nan
6 6 345IT_DB1.mp3 345IT_DB1.mp3 345IT_DB1.mp4 345IT_DB1.mp4 345IT_DB1.txt 345IT_DB1.txt 345IT_FDG_DB1.csv 345IT_FDG_DB1.csv
7 7 456_DB1.csv 456_DB1.csv nan nan nan
8 8 456IT_DB1.mp3 456IT_DB1.mp3 456IT_DB1.mp4 456IT_DB1.mp4 456IT_DB1.txt 456IT_DB1.txt nan

updated - just head of name更新 - 只是名字的

# change from a table to a list, create columns that are the head 
df2 = df.rename_axis("col", axis=1).unstack().reset_index(drop=True).dropna().apply(lambda s: {
    "h":s.split(".")[0].split("_")[0],
    "o":s}).apply(pd.Series).sort_values(["h","o"])

# work out ordering of file,  then transform back into a table
df2 = df2.assign(col=df2.groupby(["h"])["o"].transform("cumcount") + 1).set_index(["col","h"]).unstack(0).reset_index(drop=True).droplevel(0, axis=1)

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8
0 0 123IT_DB1.mp3 123IT_DB1.mp3 123IT_DB1.mp4 123IT_DB1.mp4 123IT_DB1.txt 123IT_DB1.txt 123IT_DB1_2.mp3 123IT_DB1_2.mp3 123IT_DB1_2.mp4 123IT_DB1_2.mp4 123IT_DB_2.txt 123IT_DB_2.txt 123IT_FDG_DB1.csv 123IT_FDG_DB1.csv 123IT_GUY_DB1_2.csv 123IT_GUY_DB1_2.csv
1 1 234IT_DB1.mp3 234IT_DB1.mp3 234IT_DB1.mp4 234IT_DB1.mp4 234IT_DB1.txt 234IT_DB1.txt 234IT_DB1_2.mp4 234IT_DB1_2.mp4 234IT_FDG_DB1.csv 234IT_FDG_DB1.csv nan nan nan
2 2 345IT_DB1.mp3 345IT_DB1.mp3 345IT_DB1.mp4 345IT_DB1.mp4 345IT_DB1.txt 345IT_DB1.txt 345IT_DB1_2.mp3 345IT_DB1_2.mp3 345IT_DB1_2.mp4 345IT_DB1_2.mp4 345IT_DB1_3.mp4 345IT_DB1_3.mp4 345IT_FDG_DB1.csv 345IT_FDG_DB1.csv nan
3 3 456_DB1.csv 456_DB1.csv nan nan nan nan nan nan nan
4 4 456IT_DB1.mp3 456IT_DB1.mp3 456IT_DB1.mp4 456IT_DB1.mp4 456IT_DB1.txt 456IT_DB1.txt nan nan nan nan nan

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM