I have a quite big dataframe containing 3 columns, each of which contains a list. Each list can be arbitrarily long (and is, sometimes, quite long).
a Yes No
[1, 2, 3] ["a", "b"] ["A", "B", "C"]
[7, 11, 6] ["a", "d", "f"] ["C", "H", "L", "Z"]
I want to create a new dataframe which looks like this:
C1 C2 value
1 a 1
1 b 1
2 a 1
2 b 1
...
3 b 1
1 A 0
1 B 0
1 C 0
...
3 C 0
...
7 a 1
7 d 1
...
6 Z 0
The order of the rows does not matter. I'm looking for an efficient way of doing that.
What I'm doing at the moment is the following (I might be complicating things a bit):
new_df = pd.DataFrame(columns=["C1", "C2", "value"])
def compute_pairs(df1, new_df):
perm = list(itertools.product(df1.a, df1.Yes))
for p in perm:
new_df.loc[len(new_df)] = list(p) + [1]
perm = list(itertools.product(df1.a, df1.No))
for p in perm:
df.loc[len(new_df)] = list(p) + [0]
df1.apply(lambda x: compute_pairs(x, new_df), axis=1)
However, the for look is quite slow. I tried to use map but I failed.
Any improvements?
Let's try melt
with explode
:
(df.melt('a',var_name='value',value_name='C')
.explode('C').explode('a')
)
Output:
a value C
0 1 Yes a
0 2 Yes a
0 3 Yes a
0 1 Yes b
0 2 Yes b
0 3 Yes b
1 7 Yes a
1 11 Yes a
1 6 Yes a
1 7 Yes d
1 11 Yes d
1 6 Yes d
1 7 Yes f
1 11 Yes f
1 6 Yes f
2 1 No A
2 2 No A
2 3 No A
2 1 No B
2 2 No B
2 3 No B
2 1 No C
2 2 No C
2 3 No C
3 7 No C
3 11 No C
3 6 No C
3 7 No H
3 11 No H
3 6 No H
3 7 No L
3 11 No L
3 6 No L
3 7 No Z
3 11 No Z
3 6 No Z
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.