I have this dataframe:
mp4 mp3 txt csv
123IT_DB1.mp4 123IT_DB1.mp3 123IT_DB1.txt 123IT_FDG_DB1.csv
NaN 123IT_DB1_2.mp3 NaN NaN
123IT_DB1_2.mp4 NaN NaN NaN
NaN NaN 123IT_DB_2.txt NaN
NaN NaN NaN 123IT_GUY_DB1_2.csv
234IT_DB1.mp4 NaN 234IT_DB1.txt 234IT_FDG_DB1.csv
234IT_DB1_2.mp4 234IT_DB1.mp3 NaN NaN
345IT_DB1.mp4 345IT_DB1.mp3 345IT_DB1.txt 345IT_FDG_DB1.csv
345IT_DB1_2.mp4 345IT_DB1_2.mp3 NaN NaN
345IT_DB1_3.mp4 NaN NaN NaN
456IT_DB1.mp4 456IT_DB1.mp3 456IT_DB1.txt 456_DB1.csv
I want to rearrange this dataframe so that all values that start with the same split at the first underscore are on the same row. However, if there are more than one values that start with said string, then that row should only contain that element and the rest of the columns should be blank. The resulting input should look like this:
mp4 mp3 txt csv
123IT_DB1.mp4 123IT_DB1.mp3 123IT_DB1.txt 123IT_FDG_DB1.csv
123IT_DB1_2.mp4 123IT_DB1_2.mp3 123IT_DB_2.txt 123IT_2_DB1.csv
234IT_DB1.mp4 234IT_DB1.mp3 234IT_DB1.txt 234IT_FDG_DB1.csv
234IT_DB1_2.mp4 NaN NaN NaN
345IT_DB1.mp4 345IT_DB1.mp3 345IT_DB1.txt 345IT_FDG_DB1.csv
345IT_DB1_2.mp4 345IT_DB1_2.mp3 NaN NaN
345IT_DB1_3 NaN NaN NaN
456IT_DB1.mp4 456IT_DB1.mp3 456IT_DB1.txt 456_DB1.csv
As you can see, I can't just delete the NaN's because I need some of them to stay. Any help would be much appreciated.
To get to your target
groupby()
these then use cumcount
to get an incremental number for each grouped file import io
df = pd.read_csv(io.StringIO("""1 2 3 4
123IT_DB1.mp4 123IT_DB1.mp3 123IT_DB1.txt 123IT_FDG_DB1.csv
NaN 123IT_DB1_2.mp3 NaN NaN
123IT_DB1_2.mp4 NaN NaN NaN
NaN NaN 123IT_DB_2.txt NaN
NaN NaN NaN 123IT_GUY_DB1_2.csv
234IT_DB1.mp4 NaN 234IT_DB1.txt 234IT_FDG_DB1.csv
234IT_DB1_2.mp4 234IT_DB1.mp3 NaN NaN
345IT_DB1.mp4 345IT_DB1.mp3 345IT_DB1.txt 345IT_FDG_DB1.csv
345IT_DB1_2.mp4 345IT_DB1_2.mp3 NaN NaN
345IT_DB1_3.mp4 NaN NaN NaN
456IT_DB1.mp4 456IT_DB1.mp3 456IT_DB1.txt 456_DB1.csv"""), sep="\s+")
# change from a table to a list, create columns that are the head & tail
df2 = df.rename_axis("col", axis=1).unstack().reset_index(drop=True).dropna().apply(lambda s: {
"h":s.split(".")[0].split("_")[0],
"t":s.split(".")[0].split("_")[-1],
"o":s}).apply(pd.Series).sort_values(["h","t","o"])
# work out ordering of file, then transform back into a table
df2 = df2.assign(col=df2.groupby(["h","t"])["o"].transform("cumcount") + 1).set_index(["col","h","t"]).unstack(0).reset_index(drop=True).droplevel(0, axis=1)
1 | 2 | 3 | 4 | |
---|---|---|---|---|
0 | 123IT_DB1_2.mp3 | 123IT_DB1_2.mp4 | 123IT_DB_2.txt | 123IT_GUY_DB1_2.csv |
1 | 123IT_DB1.mp3 | 123IT_DB1.mp4 | 123IT_DB1.txt | 123IT_FDG_DB1.csv |
2 | 234IT_DB1_2.mp4 | nan | nan | nan |
3 | 234IT_DB1.mp3 | 234IT_DB1.mp4 | 234IT_DB1.txt | 234IT_FDG_DB1.csv |
4 | 345IT_DB1_2.mp3 | 345IT_DB1_2.mp4 | nan | nan |
5 | 345IT_DB1_3.mp4 | nan | nan | nan |
6 | 345IT_DB1.mp3 | 345IT_DB1.mp4 | 345IT_DB1.txt | 345IT_FDG_DB1.csv |
7 | 456_DB1.csv | nan | nan | nan |
8 | 456IT_DB1.mp3 | 456IT_DB1.mp4 | 456IT_DB1.txt | nan |
# change from a table to a list, create columns that are the head
df2 = df.rename_axis("col", axis=1).unstack().reset_index(drop=True).dropna().apply(lambda s: {
"h":s.split(".")[0].split("_")[0],
"o":s}).apply(pd.Series).sort_values(["h","o"])
# work out ordering of file, then transform back into a table
df2 = df2.assign(col=df2.groupby(["h"])["o"].transform("cumcount") + 1).set_index(["col","h"]).unstack(0).reset_index(drop=True).droplevel(0, axis=1)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|
0 | 123IT_DB1.mp3 | 123IT_DB1.mp4 | 123IT_DB1.txt | 123IT_DB1_2.mp3 | 123IT_DB1_2.mp4 | 123IT_DB_2.txt | 123IT_FDG_DB1.csv | 123IT_GUY_DB1_2.csv |
1 | 234IT_DB1.mp3 | 234IT_DB1.mp4 | 234IT_DB1.txt | 234IT_DB1_2.mp4 | 234IT_FDG_DB1.csv | nan | nan | nan |
2 | 345IT_DB1.mp3 | 345IT_DB1.mp4 | 345IT_DB1.txt | 345IT_DB1_2.mp3 | 345IT_DB1_2.mp4 | 345IT_DB1_3.mp4 | 345IT_FDG_DB1.csv | nan |
3 | 456_DB1.csv | nan | nan | nan | nan | nan | nan | nan |
4 | 456IT_DB1.mp3 | 456IT_DB1.mp4 | 456IT_DB1.txt | nan | nan | nan | nan | nan |
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.