[英]Pandas: Conditionally dropping columns based on same values throughout the column in MultiIndex dataframe
I have a dataframe as below:我有一个 dataframe 如下:
data = {('5105', 'Open'): [1.99,1.98,1.99,2.05,2.15],
('5105', 'Adj Close'): [1.92,1.92,1.96,2.07,2.08],
('5229', 'Open'): [0.01]*5,
('5229', 'Adj Close'): [0.02]*5,
('7076', 'Open'): [1.02,1.01,1.01,1.06,1.06],
('7076', 'Adj Close'): [0.90,0.92,0.94,0.94,0.95]}
df = pd.DataFrame(data)
5105 5229 7076
Open Adj Close Open Adj Close Open Adj Close
0 1.99 1.92 0.01 0.02 1.02 0.90
1 1.98 1.92 0.01 0.02 1.01 0.92
2 1.99 1.96 0.01 0.02 1.01 0.94
3 2.05 2.07 0.01 0.02 1.06 0.94
4 2.15 2.08 0.01 0.02 1.06 0.95
As the dataframe above, we can see that df['5229']
has both columns Open
and Adj Close
having the same values respectively throughout the column.如上面的 dataframe,我们可以看到df['5229']
Open
和Adj Close
两列在整个列中分别具有相同的值。 So, I intend to drop it since it will not be useful in my analysis.所以,我打算放弃它,因为它对我的分析没有用。
I have two queries:我有两个疑问:
As this is a conditional-based dropping, I was wondering if df.drop
still works in this case?由于这是基于条件的丢弃,我想知道df.drop
在这种情况下是否仍然有效?
Based on my 1st and 2nd query, in my case above, since the Open
and Adj Close
are having same values throughout the column, I would like to drop it entirely.根据我的第一个和第二个查询,在我上面的例子中,由于Open
和Adj Close
在整个列中具有相同的值,我想完全放弃它。
The expected output is:预期的 output 是:
5105 7076
Open Adj Close Open Adj Close
0 1.99 1.92 1.02 0.90
1 1.98 1.92 1.01 0.92
2 1.99 1.96 1.01 0.94
3 2.05 2.07 1.06 0.94
4 2.15 2.08 1.06 0.95
Really thank you for those answering the question.真的很感谢回答问题的人。 Just to be more concise, I was trying to drop the columns from the dataframe consisting of more than 200 columns given the condition if all the values in that particular column are the same.为了更简洁,我试图从 dataframe 中删除包含 200 多列的列,条件是该特定列中的所有值都相同。
Try with nunique
试试nunique
df = df.loc[:,~(df.nunique()==1).values]
Out[125]:
5105 7076
Open Adj Close Open Adj Close
0 1.99 1.92 1.02 0.90
1 1.98 1.92 1.01 0.92
2 1.99 1.96 1.01 0.94
3 2.05 2.07 1.06 0.94
4 2.15 2.08 1.06 0.95
Try this:尝试这个:
df.drop('5229',level=0,axis=1)
Output: Output:
5105 7076
Open Adj Close Open Adj Close
0 1.99 1.92 1.02 0.90
1 1.98 1.92 1.01 0.92
2 1.99 1.96 1.01 0.94
3 2.05 2.07 1.06 0.94
4 2.15 2.08 1.06 0.95
We could use unstack
+ groupby
+ nunique
to get the number of unique values in each column.我们可以使用unstack
+ groupby
+ nunique
来获取每列中唯一值的数量。 Then select only the columns with more than 1 value by the loc
:然后 select 只有loc
值超过 1 的列:
out = df[df.unstack().groupby(level=[0,1]).nunique().loc[lambda x: x!=1].index]
Output: Output:
5105 7076
Adj Close Open Adj Close Open
0 1.92 1.99 0.90 1.02
1 1.92 1.98 0.92 1.01
2 1.96 1.99 0.94 1.01
3 2.07 2.05 0.94 1.06
4 2.08 2.15 0.95 1.06
you can try this:你可以试试这个:
for a, b in df.columns:
if df[a][b].duplicated(keep=False).sum() == df[a][b].size:
df.drop((a, b), axis=1, inplace=True)
Result:结果:
5105 7076
Open Adj Close Open Adj Close
0 1.99 1.92 1.02 0.90
1 1.98 1.92 1.01 0.92
2 1.99 1.96 1.01 0.94
3 2.05 2.07 1.06 0.94
4 2.15 2.08 1.06 0.95
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.