简体   繁体   English

在熊猫数据框中堆叠选择列作为行

[英]Stacking select columns as rows in pandas dataframe

Suppose I have df_in below:假设我有下面的df_in

df_in = pd.DataFrame({'X': ['a', 'b', 'c'], 'A': [1, 0, 0], 'B': [1, 1, 0]})

df_in : df_in :

+---+---+---+---+
|   | X | A | B |
+---+---+---+---+
| 0 | a | 1 | 1 |
| 1 | b | 0 | 1 |
| 2 | c | 0 | 0 |
+---+---+---+---+

I want to achieve something like the following:我想实现以下目标:

df_out = pd.DataFrame({'X': ['a', 'a', 'b'], 'Y': ['A', 'B', 'B']})

df_out : df_out :

+---+---+---+
|   | X | Y |
+---+---+---+
| 0 | a | A |
| 1 | a | B |
| 2 | b | B |
+---+---+---+

I also have a list containing the columns: l = list(['A', 'B']) .我还有一个包含列的列表: l = list(['A', 'B']) The logic is, for each column in df_in that is in l , repeat those observations where the column value == 1 , and add the column name to a new column in df_out , this is Y in the example.逻辑是,对于df_inl每一列,重复那些列值== 1观察,并将列名添加到df_out的新列,在示例中为Y In reality there are more columns in df_in and not all of them are in l , which is why I want to solve this without explicit references to columns A , B and X .实际上, df_in有更多的列,并不是所有的列都在l ,这就是为什么我想在不显式引用ABX列的情况下解决这个问题。

NOTE : This is not entirely covered by this answer since, as stated above, there are many columns in reality, and these can be of any type and data, so the solution, df_out , needs to take into account all of the original columns ( X in this case).注意:此答案并未完全涵盖这一点,因为如上所述,现实中有许多列,这些列可以是任何类型和数据,因此解决方案df_out需要考虑所有原始列( X在这种情况下)。 In theory, X can also be a binary 0/1 column, but should only affect the outcome in the same way as A and B if it's included in l .理论上, X也可以是二进制0/1列,但如果它包含在l ,则应该只以与AB相同的方式影响结果。 I hope this helps clarify.我希望这有助于澄清。

Use Index.difference for all columns without l pass to DataFrame.set_index , reshape by DataFrame.stack , filter only 1 and last convert MultiIndex.to_frame to new DataFrame with rename last column:对所有列使用Index.difference而不将l传递给DataFrame.set_index ,通过DataFrame.stack重塑,仅过滤1并最后将MultiIndex.to_frame转换为新的DataFrame rename最后一列:

l = ['A', 'B']

c = df_in.columns.difference(l, sort=False).tolist()
s = df_in.set_index(c).stack()
df_out = s[s == 1].index.to_frame(index=False).rename(columns={len(c):'Y'})
print (df_out)
   X  Y
0  a  A
1  a  B
2  b  B

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM