How to rearrange the rows of a dataframe so that each row starts with the same string

Question

I have this dataframe:

mp4             mp3              txt               csv
123IT_DB1.mp4   123IT_DB1.mp3    123IT_DB1.txt     123IT_FDG_DB1.csv
NaN             123IT_DB1_2.mp3  NaN               NaN
123IT_DB1_2.mp4 NaN              NaN               NaN
NaN             NaN              123IT_DB_2.txt    NaN
NaN             NaN              NaN               123IT_GUY_DB1_2.csv
234IT_DB1.mp4   NaN              234IT_DB1.txt     234IT_FDG_DB1.csv 
234IT_DB1_2.mp4 234IT_DB1.mp3    NaN               NaN
345IT_DB1.mp4   345IT_DB1.mp3    345IT_DB1.txt     345IT_FDG_DB1.csv    
345IT_DB1_2.mp4 345IT_DB1_2.mp3  NaN               NaN
345IT_DB1_3.mp4 NaN              NaN               NaN
456IT_DB1.mp4   456IT_DB1.mp3    456IT_DB1.txt     456_DB1.csv

I want to rearrange this dataframe so that all values that start with the same split at the first underscore are on the same row. However, if there are more than one values that start with said string, then that row should only contain that element and the rest of the columns should be blank. The resulting input should look like this:

    mp4             mp3             txt               csv
    123IT_DB1.mp4   123IT_DB1.mp3   123IT_DB1.txt     123IT_FDG_DB1.csv
    123IT_DB1_2.mp4 123IT_DB1_2.mp3 123IT_DB_2.txt    123IT_2_DB1.csv
    234IT_DB1.mp4   234IT_DB1.mp3   234IT_DB1.txt     234IT_FDG_DB1.csv 
    234IT_DB1_2.mp4 NaN             NaN               NaN
    345IT_DB1.mp4   345IT_DB1.mp3   345IT_DB1.txt     345IT_FDG_DB1.csv    
    345IT_DB1_2.mp4 345IT_DB1_2.mp3 NaN               NaN
    345IT_DB1_3     NaN             NaN               NaN
    456IT_DB1.mp4   456IT_DB1.mp3   456IT_DB1.txt     456_DB1.csv

As you can see, I can't just delete the NaN's because I need some of them to stay. Any help would be much appreciated.

Answer 1

To get to your target

reshape table to a simple set of rows
generate grouping columns, h part before first underscore, t part after last underscore
sort and groupby() these then use cumcount to get an incremental number for each grouped file
reshape back into a table

import io
df = pd.read_csv(io.StringIO("""1               2                3                  4
123IT_DB1.mp4   123IT_DB1.mp3    123IT_DB1.txt     123IT_FDG_DB1.csv
NaN             123IT_DB1_2.mp3  NaN               NaN
123IT_DB1_2.mp4 NaN              NaN               NaN
NaN             NaN              123IT_DB_2.txt    NaN
NaN             NaN              NaN               123IT_GUY_DB1_2.csv
234IT_DB1.mp4   NaN              234IT_DB1.txt     234IT_FDG_DB1.csv 
234IT_DB1_2.mp4 234IT_DB1.mp3    NaN               NaN
345IT_DB1.mp4   345IT_DB1.mp3    345IT_DB1.txt     345IT_FDG_DB1.csv    
345IT_DB1_2.mp4 345IT_DB1_2.mp3  NaN               NaN
345IT_DB1_3.mp4 NaN              NaN               NaN
456IT_DB1.mp4   456IT_DB1.mp3    456IT_DB1.txt     456_DB1.csv"""), sep="\s+")

# change from a table to a list, create columns that are the head & tail
df2 = df.rename_axis("col", axis=1).unstack().reset_index(drop=True).dropna().apply(lambda s: {
    "h":s.split(".")[0].split("_")[0],
    "t":s.split(".")[0].split("_")[-1],
    "o":s}).apply(pd.Series).sort_values(["h","t","o"])

# work out ordering of file,  then transform back into a table
df2 = df2.assign(col=df2.groupby(["h","t"])["o"].transform("cumcount") + 1).set_index(["col","h","t"]).unstack(0).reset_index(drop=True).droplevel(0, axis=1)

output

	1	2	3	4
0	123IT_DB1_2.mp3	123IT_DB1_2.mp4	123IT_DB_2.txt	123IT_GUY_DB1_2.csv
1	123IT_DB1.mp3	123IT_DB1.mp4	123IT_DB1.txt	123IT_FDG_DB1.csv
2	234IT_DB1_2.mp4	nan	nan	nan
3	234IT_DB1.mp3	234IT_DB1.mp4	234IT_DB1.txt	234IT_FDG_DB1.csv
4	345IT_DB1_2.mp3	345IT_DB1_2.mp4	nan	nan
5	345IT_DB1_3.mp4	nan	nan	nan
6	345IT_DB1.mp3	345IT_DB1.mp4	345IT_DB1.txt	345IT_FDG_DB1.csv
7	456_DB1.csv	nan	nan	nan
8	456IT_DB1.mp3	456IT_DB1.mp4	456IT_DB1.txt	nan

updated - just head of name

# change from a table to a list, create columns that are the head 
df2 = df.rename_axis("col", axis=1).unstack().reset_index(drop=True).dropna().apply(lambda s: {
    "h":s.split(".")[0].split("_")[0],
    "o":s}).apply(pd.Series).sort_values(["h","o"])

# work out ordering of file,  then transform back into a table
df2 = df2.assign(col=df2.groupby(["h"])["o"].transform("cumcount") + 1).set_index(["col","h"]).unstack(0).reset_index(drop=True).droplevel(0, axis=1)

	1	2	3	4	5	6	7	8
0	123IT_DB1.mp3	123IT_DB1.mp4	123IT_DB1.txt	123IT_DB1_2.mp3	123IT_DB1_2.mp4	123IT_DB_2.txt	123IT_FDG_DB1.csv	123IT_GUY_DB1_2.csv
1	234IT_DB1.mp3	234IT_DB1.mp4	234IT_DB1.txt	234IT_DB1_2.mp4	234IT_FDG_DB1.csv	nan	nan	nan
2	345IT_DB1.mp3	345IT_DB1.mp4	345IT_DB1.txt	345IT_DB1_2.mp3	345IT_DB1_2.mp4	345IT_DB1_3.mp4	345IT_FDG_DB1.csv	nan
3	456_DB1.csv	nan	nan	nan	nan	nan	nan	nan
4	456IT_DB1.mp3	456IT_DB1.mp4	456IT_DB1.txt	nan	nan	nan	nan	nan

How to rearrange the rows of a dataframe so that each row starts with the same string

Question

1 answers

solution1
0 2021-02-18 21:29:50

output

updated - just head of name

How to rearrange the rows of a dataframe so that each row starts with the same string

Question

1 answers

solution1 0 2021-02-18 21:29:50

output

updated - just head of name

solution1
0 2021-02-18 21:29:50