简体   繁体   English

如何在不丢失索引的情况下转换熊猫中的数据框?

[英]How can I transform a dataframe in pandas without losing my index?

I need to winsorize two columns in my dataframe of 12 columns. 我需要在12列的数据框中对2列进行winsorize。

Say, I have columns 'A', 'B', 'C', and 'D', each with a series of values. 说,我有列“ A”,“ B”,“ C”和“ D”,每个列都有一系列值。 Given that I cleaned some NaN columns, the number of columns was reduced from 100 to 80, but they are still indexed to 100 with gaps (eg row 5 is missing). 考虑到我清理了一些NaN列,列的数量从100减少到80,但是它们仍然被索引为100,但有间隔(例如,缺少第5行)。

I want to transform only columns 'A' and 'B' via winsorize method. 我想通过winsorize方法仅转换列“ A”和“ B”。 To do this, I must convert my columns to a np.array. 为此,我必须将列转换为np.array。

import scipy.stats
df['A','B','C','D'] = #some values per each column
ab_df = df['A','B']
X = scipy.stats.mstats.winsorize(ab_df.values, limits=0.01)
new_ab_df = pd.DataFrame(X, columns = ['A','B'])
df = pd.concat([df['C','D'], new_ab_df], axis=1, join='inner', join_axes=[df.index])

When I convert to a np.array, then back to a pd.DataFrame, it's len() is correct at 80 but my indexes have been reset to be 0->80. 当我转换为np.array,然后返回至pd.DataFrame时,它的len()正确为80,但是我的索引已重置为0-> 80。 How can I ensure that my transform 'A' and 'B' columns are indexed correctly? 如何确保正确地对转换“ A”和“ B”列进行索引? I don't think I can use the apply(), which would preserve index order and simply swap out the values instead of my approach, which creates a transformed copy of my df with only 2 columns, then concats them to the rest of my non-transformed columns. 我不认为我可以使用apply()来保留索引顺序并简单地换出值,而不是使用我的方法,即创建仅2列的df转换后的副本,然后将其连接到我的其余部分非转换列。

You can do this inplace to the original dataframe. 您可以就地对原始数据框执行此操作。

From the description of your question, it sounds like you are confusing rows and columns (ie you first say your dataframe has 12 columns, and then say the number of columns was reduced from 100 to 80). 从对问题的描述中,听起来好像您在混淆行和列(即,您首先说您的数据框有12列,然后说列数从100减少到80)。

It is always best to provide a minimal example of data in your question. 始终最好在问题中提供一个最小的数据示例。 Lacking this, here is some data based on my assumptions: 缺乏这一点,以下是基于我的假设的一些数据:

import numpy as np
import scipy.stats
import pandas as pd

np.random.seed(0)
df = pd.DataFrame(np.random.randn(7, 5), columns=list('ABCDE'))
df.iat[1, 0] = np.nan
df.iat[3, 1] = np.nan
df.iat[5, 2] = np.nan

>>> df
          A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
1       NaN  0.950088 -0.151357 -0.103219  0.410599
2  0.144044  1.454274  0.761038  0.121675  0.443863
3  0.333674       NaN -0.205158  0.313068 -0.854096
4 -2.552990  0.653619  0.864436 -0.742165  2.269755
5 -1.454366  0.045759       NaN  1.532779  1.469359
6  0.154947  0.378163 -0.887786 -1.980796 -0.347912

My assumption is to drop any row with a NaN, and then winsorize. 我的假设是删除带有NaN的任何行,然后进行winsorize。

mask = df.notnull().all(axis=1), ['A', 'B']
df.loc[mask] = scipy.stats.mstats.winsorize(df.loc[mask].values, limits=0.4)

I applied a high limit to the winsorize function so that the results are more obvious on this small dataset. 我对winsorize函数设置了上限,以便在这个小型数据集上结果更加明显。

>>> df
          A         B         C         D         E
0  0.400157  0.400157  0.978738  2.240893  1.867558
1       NaN  0.950088 -0.151357 -0.103219  0.410599
2  0.378163  0.400157  0.761038  0.121675  0.443863
3  0.333674       NaN -0.205158  0.313068 -0.854096
4  0.378163  0.400157  0.864436 -0.742165  2.269755
5 -1.454366  0.045759       NaN  1.532779  1.469359
6  0.378163  0.378163 -0.887786 -1.980796 -0.347912

Just assign the new values to the existing columns. 只需将新值分配给现有列。

X = scipy.stats.mstats.winsorize(ab_df.values, limits=0.01)
df.loc[:, ['A', 'B']] = X

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM