简体   繁体   English

如何在multiindex pandas数据帧中获取随机样本?

[英]how to get a random sample in a multiindex pandas dataframe?

I have a dataframe that is indexed according to the following variables: NAME - date. 我有一个根据以下变量索引的数据框:NAME - date。 Name is some sort of bizarre ID, and date is.. a date. 姓名是某种奇怪的ID,日期是......日期。

The data is very large and I would like to inspect the data I have for several random choices of NAME. 数据非常大,我想检查我有几个NAME随机选择的数据。

That is, 那是,

  1. pick a random NAME among the possible ones 在可能的NAME中选择一个随机的NAME
  2. inspect the data for this NAME, ordered by time. 检查此NAME的数据,按时间排序。

I dont know how to do that. 我不知道该怎么做。 I see that we can use get_level_values , but I dont have a specific NAME in mind, I just want to call random samples many times. 我看到我们可以使用get_level_values ,但我没有特定的NAME,我只想多次调用随机样本。

Any help appreciated! 任何帮助赞赏! Thanks! 谢谢!

import pandas as pd
import numpy as np
import random
import string

df = pd.DataFrame(data={'NAME': [''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for _ in range(17)) for _ in range(10)],
            'Date': pd.date_range('1/01/2016', periods=10),
            'Whatever': np.random.randint(20, 50, 10)},
                  columns=['NAME', 'Date', 'Whatever']).set_index(['NAME', 'Date'])

random_df = df[df.index.get_loc(np.random.choice(df.index.levels[0])) == True].sort_index(level=1)
print(random_df)

Returns a df that looks like this: 返回如下所示的df

                              Whatever
NAME              Date                
xg71zOEQVOEfCZ2ne 2016-01-01        35
qLCXuEerCXi6gmF1Y 2016-01-02        26
0vDe7x8TIb5FRv7hV 2016-01-03        40
Ddc6FGKBdtcLqT53O 2016-01-04        31
IYcrKG9pjt7mHH3qn 2016-01-05        44
lAWObNTC8yXPMY3v5 2016-01-06        49
k90QWdPc5qFSCFi1c 2016-01-07        22
BWQoHo8lUyEwK9Nuf 2016-01-08        42
Xt0bxUerTan0i1eGw 2016-01-09        22
tc7PYCzpyGmYLbnxu 2016-01-10        46

A random_df that looks like this: 一个random_df

                              Whatever
NAME              Date                
IYcrKG9pjt7mHH3qn 2016-01-05        44

You could forget your multi-index, and just use isin with sample : 你可能会忘记你的多索引,只需使用带有sample isin

import random
df = df.reset_index()
df[df['NAME'].isin(random.sample(list(df['NAME'].unique()),5))]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM