简体   繁体   English

在 Pandas 中混洗/排列 DataFrame

[英]shuffling/permutating a DataFrame in pandas

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns?按行或按列在 Pandas 中混洗数据帧的简单有效方法是什么? Ie how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n , and an axis ( axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.即如何编写一个函数shuffle(df, n, axis=0) ,它接受一个数据帧、多次shuffle(df, n, axis=0)n和一个轴( axis=0是行, axis=1是列)并返回数据帧的副本这已经被洗牌了n次。

Edit : key is to do this without destroying the row/column labels of the dataframe.编辑:关键是在不破坏数据框的行/列标签的情况下执行此操作。 If you just shuffle df.index that loses all that information.如果你只是洗牌df.index会丢失所有这些信息。 I want the resulting df to be the same as the original except with the order of rows or order of columns different.我希望生成的df与原始df相同,只是行的顺序或列的顺序不同。

Edit2 : My question was unclear. Edit2 :我的问题不清楚。 When I say shuffle the rows, I mean shuffle each row independently.当我说洗牌时,我的意思是独立洗牌每一行。 So if you have two columns a and b , I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole.因此,如果您有两列ab ,我希望每一行都单独洗牌,这样您就不会像将每一行作为一个整体重新排序时那样在ab之间具有相同的关联。 Something like:就像是:

for 1...n:
  for each col in df: shuffle column
return new_df

But hopefully more efficient than naive looping.但希望比天真的循环更有效。 This does not work for me:这对我不起作用:

def shuffle(df, n, axis=0):
        shuffled_df = df.copy()
        for k in range(n):
            shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
        return shuffled_df

df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)

Use numpy's random.permuation function:使用 numpy 的random.permuation函数:

In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [2]: df
Out[2]:
   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9


In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
   A  B
0  0  0
5  5  5
6  6  6
3  3  3
8  8  8
7  7  7
9  9  9
1  1  1
2  2  2
4  4  4

采样是随机的,所以只需对整个数据帧进行采样。

df.sample(frac=1)
In [16]: def shuffle(df, n=1, axis=0):     
    ...:     df = df.copy()
    ...:     for _ in range(n):
    ...:         df.apply(np.random.shuffle, axis=axis)
    ...:     return df
    ...:     

In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [18]: shuffle(df)

In [19]: df
Out[19]: 
   A  B
0  8  5
1  1  7
2  7  3
3  6  2
4  3  4
5  0  1
6  9  0
7  4  6
8  2  8
9  5  9

You can use sklearn.utils.shuffle() ( requires sklearn 0.16.1 or higher to support Pandas data frames):您可以使用sklearn.utils.shuffle()需要sklearn 0.16.1 或更高版本才能支持 Pandas 数据帧):

# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))

# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))

outputs:输出:

df:    A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4


df:    A  B
1  1  1
0  0  0
3  3  3
4  4  4
2  2  2

Then you can use df.reset_index() to reset the index column, if needs to be:然后你可以使用df.reset_index()来重置索引列,如果需要的话:

df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)

outputs:输出:

df:    A  B
0  1  1
1  0  0
2  4  4
3  2  2
4  3  3

A simple solution in pandas is to use the sample method independently on each column. pandas 中的一个简单解决方案是在每列上独立使用sample方法。 Use apply to iterate over each column:使用apply迭代每一列:

df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

df.apply(lambda x: x.sample(frac=1).values)

   a  b
0  4  2
1  1  6
2  6  5
3  5  3
4  2  4
5  3  1

You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:您必须使用.value以便您返回一个 numpy 数组而不是一个系列,否则返回的系列将与原始 DataFrame 对齐而不改变任何事物:

df.apply(lambda x: x.sample(frac=1))

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

From the docs use sample() :从文档中使用sample()

In [79]: s = pd.Series([0,1,2,3,4,5])

# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]: 
0    0
dtype: int64

# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]: 
5    5
2    2
4    4
dtype: int64

# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]: 
5    5
4    4
1    1
dtype: int64

I resorted to adapting @root 's answer slightly and using the raw values directly.我稍微调整了@root 的答案并直接使用原始值。 Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.当然,这意味着您失去了进行花哨索引的能力,但它非常适合仅对数据进行混洗。

In [1]: import numpy

In [2]: import pandas

In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})    

In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop

In [5]: %%timeit
   ...: for view in numpy.rollaxis(df.values, 1):
   ...:     numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 22.8 µs per loop

In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop

In [7]: %%timeit                                      
for view in numpy.rollaxis(df.values, 0):
    numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 23.4 µs per loop

Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, ie, if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.请注意numpy.rollaxis将指定的轴带到第一维,然后让我们迭代具有剩余维度的数组,即,如果我们想沿着第一维(列)进行随机播放,我们需要将第二维滚动到前面,以便我们将改组应用于第一维上的视图。

In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)

In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)

Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:然后,您的最终函数使用一个技巧使结果符合将函数应用于轴的预期:

def shuffle(df, n=1, axis=0):     
    df = df.copy()
    axis = int(not axis) # pandas.DataFrame is always 2D
    for _ in range(n):
        for view in numpy.rollaxis(df.values, axis):
            numpy.random.shuffle(view)
    return df

This might be more useful when you want your index shuffled.当您希望对索引进行洗牌时,这可能更有用。

def shuffle(df):
    index = list(df.index)
    random.shuffle(index)
    df = df.ix[index]
    df.reset_index()
    return df

It selects new df using new index, then reset them.它使用新索引选择新 df,然后重置它们。

I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.我知道问题是针对pandas df 但如果按行进行洗牌(列顺序更改,行顺序不变),那么列名不再重要,使用np.array代替可能会很有趣,然后np.apply_along_axis()就是你要找的。

If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.如果这是可以接受的,那么这会有所帮助,请注意,很容易切换数据混洗所沿的轴。

If you panda data frame is named df , maybe you can:如果您的熊猫数据框名为df ,也许您可​​以:

  1. get the values of the dataframe with values = df.values ,使用values = df.values获取数据values = df.values
  2. create an np.array from valuesvalues创建一个np.array
  3. apply the method shown below to shuffle the np.array by row or column应用下面显示的方法按行或列对np.array进行洗牌
  4. recreate a new (shuffled) pandas df from the shuffled np.array从洗牌的np.array重新创建一个新的(洗牌的)pandas df

Original array原始数组

a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

Keep row order, shuffle colums within each row保持行序,在每行内随机排列列

print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
 [22 21 20]
 [31 30 32]
 [40 41 42]]

Keep colums order, shuffle rows within each column保持列顺序,在每列中随机排列行

print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
 [20 31 42]
 [10 11 12]
 [30 21 22]]

Original array is unchanged原始数组不变

print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

Here is a work around I found if you want to only shuffle a subset of the DataFrame:如果您只想打乱 DataFrame 的一个子集,这是我发现的一个解决方法:

shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM