简体   繁体   中英

How to rearrange the rows of a dataframe so that each row starts with the same string

I have this dataframe:

mp4             mp3              txt               csv
123IT_DB1.mp4   123IT_DB1.mp3    123IT_DB1.txt     123IT_FDG_DB1.csv
NaN             123IT_DB1_2.mp3  NaN               NaN
123IT_DB1_2.mp4 NaN              NaN               NaN
NaN             NaN              123IT_DB_2.txt    NaN
NaN             NaN              NaN               123IT_GUY_DB1_2.csv
234IT_DB1.mp4   NaN              234IT_DB1.txt     234IT_FDG_DB1.csv 
234IT_DB1_2.mp4 234IT_DB1.mp3    NaN               NaN
345IT_DB1.mp4   345IT_DB1.mp3    345IT_DB1.txt     345IT_FDG_DB1.csv    
345IT_DB1_2.mp4 345IT_DB1_2.mp3  NaN               NaN
345IT_DB1_3.mp4 NaN              NaN               NaN
456IT_DB1.mp4   456IT_DB1.mp3    456IT_DB1.txt     456_DB1.csv

I want to rearrange this dataframe so that all values that start with the same split at the first underscore are on the same row. However, if there are more than one values that start with said string, then that row should only contain that element and the rest of the columns should be blank. The resulting input should look like this:

    mp4             mp3             txt               csv
    123IT_DB1.mp4   123IT_DB1.mp3   123IT_DB1.txt     123IT_FDG_DB1.csv
    123IT_DB1_2.mp4 123IT_DB1_2.mp3 123IT_DB_2.txt    123IT_2_DB1.csv
    234IT_DB1.mp4   234IT_DB1.mp3   234IT_DB1.txt     234IT_FDG_DB1.csv 
    234IT_DB1_2.mp4 NaN             NaN               NaN
    345IT_DB1.mp4   345IT_DB1.mp3   345IT_DB1.txt     345IT_FDG_DB1.csv    
    345IT_DB1_2.mp4 345IT_DB1_2.mp3 NaN               NaN
    345IT_DB1_3     NaN             NaN               NaN
    456IT_DB1.mp4   456IT_DB1.mp3   456IT_DB1.txt     456_DB1.csv

As you can see, I can't just delete the NaN's because I need some of them to stay. Any help would be much appreciated.

To get to your target

  • reshape table to a simple set of rows
  • generate grouping columns, h part before first underscore, t part after last underscore
  • sort and groupby() these then use cumcount to get an incremental number for each grouped file
  • reshape back into a table
import io
df = pd.read_csv(io.StringIO("""1               2                3                  4
123IT_DB1.mp4   123IT_DB1.mp3    123IT_DB1.txt     123IT_FDG_DB1.csv
NaN             123IT_DB1_2.mp3  NaN               NaN
123IT_DB1_2.mp4 NaN              NaN               NaN
NaN             NaN              123IT_DB_2.txt    NaN
NaN             NaN              NaN               123IT_GUY_DB1_2.csv
234IT_DB1.mp4   NaN              234IT_DB1.txt     234IT_FDG_DB1.csv 
234IT_DB1_2.mp4 234IT_DB1.mp3    NaN               NaN
345IT_DB1.mp4   345IT_DB1.mp3    345IT_DB1.txt     345IT_FDG_DB1.csv    
345IT_DB1_2.mp4 345IT_DB1_2.mp3  NaN               NaN
345IT_DB1_3.mp4 NaN              NaN               NaN
456IT_DB1.mp4   456IT_DB1.mp3    456IT_DB1.txt     456_DB1.csv"""), sep="\s+")

# change from a table to a list, create columns that are the head & tail
df2 = df.rename_axis("col", axis=1).unstack().reset_index(drop=True).dropna().apply(lambda s: {
    "h":s.split(".")[0].split("_")[0],
    "t":s.split(".")[0].split("_")[-1],
    "o":s}).apply(pd.Series).sort_values(["h","t","o"])

# work out ordering of file,  then transform back into a table
df2 = df2.assign(col=df2.groupby(["h","t"])["o"].transform("cumcount") + 1).set_index(["col","h","t"]).unstack(0).reset_index(drop=True).droplevel(0, axis=1)

output

1 2 3 4
0 123IT_DB1_2.mp3 123IT_DB1_2.mp4 123IT_DB_2.txt 123IT_GUY_DB1_2.csv
1 123IT_DB1.mp3 123IT_DB1.mp4 123IT_DB1.txt 123IT_FDG_DB1.csv
2 234IT_DB1_2.mp4 nan nan nan
3 234IT_DB1.mp3 234IT_DB1.mp4 234IT_DB1.txt 234IT_FDG_DB1.csv
4 345IT_DB1_2.mp3 345IT_DB1_2.mp4 nan nan
5 345IT_DB1_3.mp4 nan nan nan
6 345IT_DB1.mp3 345IT_DB1.mp4 345IT_DB1.txt 345IT_FDG_DB1.csv
7 456_DB1.csv nan nan nan
8 456IT_DB1.mp3 456IT_DB1.mp4 456IT_DB1.txt nan

updated - just head of name

# change from a table to a list, create columns that are the head 
df2 = df.rename_axis("col", axis=1).unstack().reset_index(drop=True).dropna().apply(lambda s: {
    "h":s.split(".")[0].split("_")[0],
    "o":s}).apply(pd.Series).sort_values(["h","o"])

# work out ordering of file,  then transform back into a table
df2 = df2.assign(col=df2.groupby(["h"])["o"].transform("cumcount") + 1).set_index(["col","h"]).unstack(0).reset_index(drop=True).droplevel(0, axis=1)

1 2 3 4 5 6 7 8
0 123IT_DB1.mp3 123IT_DB1.mp4 123IT_DB1.txt 123IT_DB1_2.mp3 123IT_DB1_2.mp4 123IT_DB_2.txt 123IT_FDG_DB1.csv 123IT_GUY_DB1_2.csv
1 234IT_DB1.mp3 234IT_DB1.mp4 234IT_DB1.txt 234IT_DB1_2.mp4 234IT_FDG_DB1.csv nan nan nan
2 345IT_DB1.mp3 345IT_DB1.mp4 345IT_DB1.txt 345IT_DB1_2.mp3 345IT_DB1_2.mp4 345IT_DB1_3.mp4 345IT_FDG_DB1.csv nan
3 456_DB1.csv nan nan nan nan nan nan nan
4 456IT_DB1.mp3 456IT_DB1.mp4 456IT_DB1.txt nan nan nan nan nan

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM