合并熊猫数据框中的两列，但按特定顺序

Question

For example, I have a dataframe where two of the columns are "Zeroes" and "Ones" that contain only zeroes and ones, respectively.例如，我有一个数据框，其中两列是“零”和“一”，分别只包含零和一。 If I combine them into one column I get first all the zeroes, then all the ones.如果我将它们组合成一列，我首先会得到所有的零，然后是所有的。

I want to combine them in a way that I get each element from both columns, not all elements from the first column and all elements from the second column.我想以一种从两列中获取每个元素的方式组合它们，而不是第一列中的所有元素和第二列中的所有元素。 So I don't want the result to be [0, 0, 0, 1, 1, 1], I need it to be [0, 1, 0, 1, 0, 1].所以我不希望结果是 [0, 0, 0, 1, 1, 1]，我需要它是 [0, 1, 0, 1, 0, 1]。

I process 100K+ rows of data.我处理了 100K+ 行数据。 What is the fastest or optimal way to achieve this?实现这一目标的最快或最佳方法是什么？ Thanks in advance!提前致谢！

Answer 1

Try:尝试：

import pandas as pd

df = pd.DataFrame({ "zeroes" : [0, 0, 0], "ones":  [1, 1, 1], "some_other" : list("abc")})
res = df[["zeroes", "ones"]].to_numpy().ravel(order="C")
print(res)

Output输出

[0 1 0 1 0 1]

Micro-Benchmarks微基准

import pandas as pd
from itertools import chain
df = pd.DataFrame({ "zeroes" : [0] * 10_000, "ones":  [1] * 10_000})
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
672 µs ± 8.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [v for vs in zip(df["zeroes"], df["ones"]) for v in vs]
2.57 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(chain.from_iterable(zip(df["zeroes"], df["ones"]))) 
2.11 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 2

You can use numpy.flatten() like below as alternative:您可以使用numpy.flatten()如下所示作为替代：

import numpy as np
import pandas as pd
df[["zeroes", "ones"]].to_numpy().flatten()

Benchmark (runnig on colab ) :基准（在colab上运行） ：

df = pd.DataFrame({ "zeroes" : [0] * 10_000_000, "ones":  [1] * 10_000_000})

%timeit df[["zeroes", "ones"]].to_numpy().flatten().tolist()
1 loop, best of 5: 320 ms per loop

%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
1 loop, best of 5: 322 ms per loop

Answer 3

I don't know if this is the most optimal solution but it should solve your case.我不知道这是否是最佳解决方案，但它应该可以解决您的问题。

df = pd.DataFrame([[0 for x in range(10)], [1 for x in range(10)]]).T
l = [[x, y] for x, y in zip(df[0], df[1])]
l = [x for y in l for x in y]
l

Answer 4

This may help you: Alternate elements of different columns using Pandas这可能对您有所帮助：使用 Pandas 替换不同列的元素

pd.concat(
    [df1, df2], axis=1
).stack().reset_index(1, drop=True).to_frame('C').rename(index='CC{}'.format)

合并熊猫数据框中的两列，但按特定顺序

问题描述

4 个解决方案

解决方案1
4 已采纳 2021-11-03 11:23:44

解决方案2
1 2021-11-03 12:10:35

解决方案3
0 2021-11-03 11:23:41

解决方案4
0 2021-11-03 11:25:48

合并熊猫数据框中的两列，但按特定顺序

问题描述

4 个解决方案

解决方案1 4 已采纳 2021-11-03 11:23:44

解决方案2 1 2021-11-03 12:10:35

解决方案3 0 2021-11-03 11:23:41

解决方案4 0 2021-11-03 11:25:48

解决方案1
4 已采纳 2021-11-03 11:23:44

解决方案2
1 2021-11-03 12:10:35

解决方案3
0 2021-11-03 11:23:41

解决方案4
0 2021-11-03 11:25:48