简体   繁体   English

随机播放 DataFrame 行

[英]Shuffle DataFrame rows

I have the following DataFrame:我有以下 DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a CSV file. DataFrame 从 CSV 文件中读取。 All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.所有Type 1 的行都在顶部,然后是Type 2 的行,然后是Type 3 的行,依此类推。

I would like to shuffle the order of the DataFrame's rows so that all Type 's are mixed.我想打乱 DataFrame 行的顺序,以便混合所有Type A possible result could be:一个可能的结果可能是:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?我怎样才能做到这一点?

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:使用 Pandas 执行此操作的惯用方法是使用数据.sample方法来对所有行进行采样而无需替换:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order). frac关键字参数指定要在随机样本中返回的行的分数,因此frac=1表示返回所有行(以随机顺序)。


Note: If you wish to shuffle your dataframe in-place and reset the index, you could do eg注意:如果您希望就地改组数据帧并重置索引,您可以执行例如

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.在这里,指定drop=True可防止.reset_index创建包含旧索引条目的列。

Follow-up note: Although it may not look like the above operation is in-place , python/pandas is smart enough not to do another malloc for the shuffled object.后续注意:虽然上面的操作看起来可能不是就地,但python/pandas足够聪明,不会对混洗后的对象再做一次malloc。 That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new) ), the underlying C object is still the same.也就是说,即使引用对象发生了变化(我的意思是id(df_old)id(df_new) ),底层的 C 对象仍然相同。 To show that this is indeed the case, you could run a simple memory profiler:为了证明确实如此,您可以运行一个简单的内存分析器:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

You can simply use sklearn for this您可以简单地为此使用 sklearn

from sklearn.utils import shuffle
df = shuffle(df)

You can shuffle the rows of a dataframe by indexing with a shuffled index.您可以通过使用混洗索引进行索引来混洗数据帧的行。 For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):为此,您可以例如使用np.random.permutation (但np.random.choice也是一种可能性):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)如果你想保持索引从 1, 2, .., n 编号,就像你的例子一样,你可以简单地重置索引: df_shuffled.reset_index(drop=True)

TL;DR : np.random.shuffle(ndarray) can do the job. TL;DR : np.random.shuffle(ndarray)可以完成这项工作。
So, in your case所以,在你的情况下

np.random.shuffle(DataFrame.values)

DataFrame , under the hood, uses NumPy ndarray as data holder. DataFrameDataFrame使用 NumPy ndarray 作为数据持有者。 (You can check from DataFrame source code ) (您可以从DataFrame 源代码中查看)

So if you use np.random.shuffle() , it would shuffles the array along the first axis of a multi-dimensional array.因此,如果您使用np.random.shuffle() ,它将沿多维数组的第一个轴对数组进行洗牌。 But index of the DataFrame remains unshuffled.但是DataFrame索引保持不变。

Though, there are some points to consider.不过,有一些要点需要考虑。

  • function returns none.函数不返回。 In case you want to keep a copy of the original object, you have to do so before you pass to the function.如果您想保留原始对象的副本,则必须在传递给函数之前这样做。
  • sklearn.utils.shuffle() , as user tj89 suggested, can designate random_state along with another option to control output. sklearn.utils.shuffle() ,正如用户 tj89 所建议的,可以指定random_state以及控制输出的另一个选项。 You may want that for dev purpose.您可能希望出于开发目的使用它。
  • sklearn.utils.shuffle() is faster. sklearn.utils.shuffle()更快。 But WILL SHUFFLE the axis info(index, column) of the DataFrame along with the ndarray it contains.但是会DataFrame的轴信息(索引,列)及其包含的ndarray

Benchmark result基准测试结果

between sklearn.utils.shuffle() and np.random.shuffle() .sklearn.utils.shuffle()np.random.shuffle()

ndarray数组

nd = sklearn.utils.shuffle(nd)

0.10793248389381915 sec. 0.10793248389381915 秒。 8x faster快 8 倍

np.random.shuffle(nd)

0.8897626010002568 sec 0.8897626010002568 秒

DataFrame数据框

df = sklearn.utils.shuffle(df)

0.3183923360193148 sec. 0.3183923360193148 秒。 3x faster快 3 倍

np.random.shuffle(df.values)

0.9357550159329548 sec 0.9357550159329548 秒

Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle() .结论:如果可以将轴 info(index, column) 与 ndarray 一起洗牌,请使用sklearn.utils.shuffle() Otherwise, use np.random.shuffle()否则,使用np.random.shuffle()

used code使用代码

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)

(I don't have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method: (我没有足够的声誉在顶级帖子上发表评论,所以我希望其他人可以为我做这件事。)有人担心第一种方法:

df.sample(frac=1)

made a deep copy or just changed the dataframe.进行了深层复制或只是更改了数据帧。 I ran the following code:我运行了以下代码:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

and my results were:我的结果是:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

which means the method is not returning the same object, as was suggested in the last comment.这意味着该方法没有返回相同的对象,正如最后一条评论中所建议的那样。 So this method does indeed make a shuffled copy .所以这个方法确实做了一个shuffled copy

What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:还有什么有用的,如果您将它用于 Machine_learning 并希望始终分离相同的数据,您可以使用:

df.sample(n=len(df), random_state=42)

this makes sure, that you keep your random choice always replicatable这可以确保您的随机选择始终可复制

Following could be one of ways:以下可能是其中一种方式:

dataframe = dataframe.sample(frac=1, random_state=42).reset_index(drop=True)

where在哪里

frac=1 means all rows of a dataframe frac=1表示数据帧的所有行

random_state=42 means keeping same order in each execution random_state=42表示在每次执行中保持相同的顺序

reset_index(drop=True) means reinitialize index for randomized dataframe reset_index(drop=True)表示重新初始化随机数据帧的索引

AFAIK 最简单的解决方案是:

df_shuffled = df.reindex(np.random.permutation(df.index))

shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame.通过在这种情况下获取样本数组索引并随机化其顺序,然后将数组设置为数据帧的索引,从而对 Pandas 数据帧进行混洗。 Now sort the data frame according to index.现在根据索引对数据框进行排序。 Here goes your shuffled dataframe这是你洗牌的数据框

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

output输出

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

Insert you data frame in the place of mine in above code .在上面的代码中将你的数据框插入我的位置。

这是另一种方式:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)

Shuffle the DataFrame using sample() by passing the frac parameter.通过传递frac参数,使用 sample() 对 DataFrame 进行随机播放。 Save the shuffled DataFrame to a new variable.将改组后的 DataFrame 保存到新变量中。

new_variable = DataFrame.sample(frac=1)

I propose this:我提出这个:

for x in df.columns:
    np.random.seed(42);
    np.random.shuffle(df[x].values)

With my test with a column of arbitrary length strings (with dtype: object ), it was 30x faster than @haku's answer, presumably because it avoids creating a copy which may be expensive.通过我对一列任意长度字符串的测试(使用dtype: object ),它比@haku 的答案快 30 倍,大概是因为它避免了创建可能很昂贵的副本。

My variant was about 3x faster than the accepted @Kris'es answer which also does not seem to avoid a copy (based on RES column in Linux top ).我的变体比接受的@Kris'es 答案快大约 3 倍,这似乎也没有避免复制(基于 Linux top中的RES列)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM