[英]shuffling/permutating a DataFrame in pandas
What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns?按行或按列在 Pandas 中混洗数据帧的简单有效方法是什么? Ie how to write a function shuffle(df, n, axis=0)
that takes a dataframe, a number of shuffles n
, and an axis ( axis=0
is rows, axis=1
is columns) and returns a copy of the dataframe that has been shuffled n
times.即如何编写一个函数shuffle(df, n, axis=0)
,它接受一个数据帧、多次shuffle(df, n, axis=0)
数n
和一个轴( axis=0
是行, axis=1
是列)并返回数据帧的副本这已经被洗牌了n
次。
Edit : key is to do this without destroying the row/column labels of the dataframe.编辑:关键是在不破坏数据框的行/列标签的情况下执行此操作。 If you just shuffle df.index
that loses all that information.如果你只是洗牌df.index
会丢失所有这些信息。 I want the resulting df
to be the same as the original except with the order of rows or order of columns different.我希望生成的df
与原始df
相同,只是行的顺序或列的顺序不同。
Edit2 : My question was unclear. Edit2 :我的问题不清楚。 When I say shuffle the rows, I mean shuffle each row independently.当我说洗牌时,我的意思是独立洗牌每一行。 So if you have two columns a
and b
, I want each row shuffled on its own, so that you don't have the same associations between a
and b
as you do if you just re-order each row as a whole.因此,如果您有两列a
和b
,我希望每一行都单独洗牌,这样您就不会像将每一行作为一个整体重新排序时那样在a
和b
之间具有相同的关联。 Something like:就像是:
for 1...n:
for each col in df: shuffle column
return new_df
But hopefully more efficient than naive looping.但希望比天真的循环更有效。 This does not work for me:这对我不起作用:
def shuffle(df, n, axis=0):
shuffled_df = df.copy()
for k in range(n):
shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
return shuffled_df
df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)
Use numpy's random.permuation
function:使用 numpy 的random.permuation
函数:
In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [2]: df
Out[2]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
A B
0 0 0
5 5 5
6 6 6
3 3 3
8 8 8
7 7 7
9 9 9
1 1 1
2 2 2
4 4 4
采样是随机的,所以只需对整个数据帧进行采样。
df.sample(frac=1)
In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9
You can use sklearn.utils.shuffle()
( requires sklearn 0.16.1 or higher to support Pandas data frames):您可以使用sklearn.utils.shuffle()
(需要sklearn 0.16.1 或更高版本才能支持 Pandas 数据帧):
# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))
# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))
outputs:输出:
df: A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
df: A B
1 1 1
0 0 0
3 3 3
4 4 4
2 2 2
Then you can use df.reset_index()
to reset the index column, if needs to be:然后你可以使用df.reset_index()
来重置索引列,如果需要的话:
df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)
outputs:输出:
df: A B
0 1 1
1 0 0
2 4 4
3 2 2
4 3 3
A simple solution in pandas is to use the sample
method independently on each column. pandas 中的一个简单解决方案是在每列上独立使用sample
方法。 Use apply
to iterate over each column:使用apply
迭代每一列:
df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
df.apply(lambda x: x.sample(frac=1).values)
a b
0 4 2
1 1 6
2 6 5
3 5 3
4 2 4
5 3 1
You must use .value
so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:您必须使用.value
以便您返回一个 numpy 数组而不是一个系列,否则返回的系列将与原始 DataFrame 对齐而不改变任何事物:
df.apply(lambda x: x.sample(frac=1))
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
From the docs use sample()
:从文档中使用sample()
:
In [79]: s = pd.Series([0,1,2,3,4,5])
# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]:
0 0
dtype: int64
# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]:
5 5
2 2
4 4
dtype: int64
# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]:
5 5
4 4
1 1
dtype: int64
I resorted to adapting @root 's answer slightly and using the raw values directly.我稍微调整了@root 的答案并直接使用原始值。 Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.当然,这意味着您失去了进行花哨索引的能力,但它非常适合仅对数据进行混洗。
In [1]: import numpy
In [2]: import pandas
In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})
In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop
In [5]: %%timeit
...: for view in numpy.rollaxis(df.values, 1):
...: numpy.random.shuffle(view)
...:
10000 loops, best of 3: 22.8 µs per loop
In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop
In [7]: %%timeit
for view in numpy.rollaxis(df.values, 0):
numpy.random.shuffle(view)
...:
10000 loops, best of 3: 23.4 µs per loop
Note that numpy.rollaxis
brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, ie, if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.请注意numpy.rollaxis
将指定的轴带到第一维,然后让我们迭代具有剩余维度的数组,即,如果我们想沿着第一维(列)进行随机播放,我们需要将第二维滚动到前面,以便我们将改组应用于第一维上的视图。
In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)
In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)
Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:然后,您的最终函数使用一个技巧使结果符合将函数应用于轴的预期:
def shuffle(df, n=1, axis=0):
df = df.copy()
axis = int(not axis) # pandas.DataFrame is always 2D
for _ in range(n):
for view in numpy.rollaxis(df.values, axis):
numpy.random.shuffle(view)
return df
This might be more useful when you want your index shuffled.当您希望对索引进行洗牌时,这可能更有用。
def shuffle(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
It selects new df using new index, then reset them.它使用新索引选择新 df,然后重置它们。
I know the question is for a pandas
df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array
instead, then np.apply_along_axis()
will be what you are looking for.我知道问题是针对pandas
df 但如果按行进行洗牌(列顺序更改,行顺序不变),那么列名不再重要,使用np.array
代替可能会很有趣,然后np.apply_along_axis()
就是你要找的。
If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.如果这是可以接受的,那么这会有所帮助,请注意,很容易切换数据混洗所沿的轴。
If you panda data frame is named df
, maybe you can:如果您的熊猫数据框名为df
,也许您可以:
values = df.values
,使用values = df.values
获取数据values = df.values
,np.array
from values
从values
创建一个np.array
np.array
by row or column应用下面显示的方法按行或列对np.array
进行洗牌np.array
从洗牌的np.array
重新创建一个新的(洗牌的)pandas dfa = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
[22 21 20]
[31 30 32]
[40 41 42]]
print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
[20 31 42]
[10 11 12]
[30 21 22]]
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Here is a work around I found if you want to only shuffle a subset of the DataFrame:如果您只想打乱 DataFrame 的一个子集,这是我发现的一个解决方法:
shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.