简体   繁体   English

python&pandas-将大型数据框拆分为多个数据框并绘制图表

[英]python & pandas - Split large dataframe into multiple dataframes and plot diagrams

I'm under the similar condition with this case . 我的情况与类似。 I'm working on a project which has a large dataframe with about half-million of rows. 我正在一个项目中,该项目的数据框很大,行数约为50万。 And about 2000 of users are involving in this.( I get this number by value_counts() counting a column called NoUsager ). 大约有2000名用户参与其中。(我通过value_counts()计数称为NoUsager的列来获得此数字)。

I'd like to split the dataframe into several array/dataframe for plotting after. 我想将数据框拆分为多个数组/数据框,以便进行绘图。 (Several means an array/dataframe for each user) I gott the list of users like: (几个表示每个用户的数组/数据框)我得到了以下用户列表:

df.sort_values(by='NoUsager',inplace=True)
df.set_index(keys=['NoUsager'],drop=False,inplace=True)
users = df['NoUsager'].unique().tolist()

I know what's after is a loop to generate the smaller dataframes but I have no idea how to make it happen. 我知道之后是一个循环来生成较小的数据帧,但是我不知道如何实现它。 And I combined the code above and tried the one in the case but there was no solution for it. 我结合了上面的代码,并尝试了这种情况,但是没有解决方案。

What should I do with it? 我该怎么办?


EDIT 编辑

I want both histogram and boxplot of the dataframe. 我想要数据框的直方图和箱线图。 With the answer provided, I already have a boxplot of all NoUsager . 有了提供的答案,我已经有了所有NoUsager的箱线图。 But with large amount of data, the boxplot is too small to read. 但是,由于数据量很大,箱线图太小而无法读取。 So I'd like to split the dataframe by NoUsager and plot them separately. 因此,我想按NoUsager拆分数据帧并分别绘制它们。 Diagrams that I'd like to have: 我想要的图表:

  1. boxplot, column= DureeService , by= NoUsager DureeService ,column = DureeService ,by = NoUsager
  2. boxplot, column= DureeService , by='Weekday` DureeService ,column = DureeService ,by ='Weekday`
  3. histogram, for every Weekday ,by= DureeService 直方图,每个Weekday ,by = DureeService

I hope this time is well explained. 我希望这次能得到很好的解释。

DataType: 数据类型:

          Weekday NoUsager Periods  Sens  DureeService
DataType   string  string  string string datetime.time

Sample of DataFrame: DataFrame示例:

Weekday NoUsager Periods Sens DureeService
Lun 000001 Matin + 00:00:05 
Lun 000001 Matin + 00:00:04 
Mer 000001 Matin + 00:00:07 
Dim 000001 Soir  - 00:00:02 
Lun 000001 Matin + 00:00:07 
Jeu 000001 Soir  - 00:00:04 
Lun 000001 Matin + 00:00:07 
Lun 000001 Soir  - 00:00:04 
Dim 000001 Matin + 00:00:05 
Lun 000001 Matin + 00:00:03 
Mer 000001 Matin + 00:00:04 
Ven 000001 Soir  - 00:00:03 
Mar 000001 Matin + 00:00:03 
Lun 000001 Soir  - 00:00:04 
Lun 000001 Matin + 00:00:04 
Mer 000002 Soir  - 00:00:04 
Jeu 000003 Matin + 00:00:50 
Mer 000003 Soir  - 00:06:51 
Mer 000003 Soir  - 00:00:08 
Mer 000003 Soir  - 00:00:10 
Jeu 000003 Matin + 00:12:35 
Lun 000004 Matin + 00:00:05 
Dim 000004 Matin + 00:00:05 
Lun 000004 Matin + 00:00:05 
Lun 000004 Matin + 00:00:05 

And what bothers me is that none of these data is number, so each time they have to be converted. 而且令我困扰的是,这些数据都不是数字,因此每次都必须进行转换。

Thanks in advance! 提前致谢!

[g for _, g in df.groupby('NoUsager')] gives you a list of data frames where each dataframe contains one unique NoUsager . [g for _, g in df.groupby('NoUsager')]为您提供数据帧列表,其中每个数据帧包含一个唯一的NoUsager But I think what you need is something like: 但我认为您需要的是:

for k, g in df.groupby('NoUsager'):
    g.plot(kind = ..., x = ..., y = ...) etc..

No need to sort first. 无需先排序。 You may try this with your original DataFrame: 您可以将其与原始DataFrame一起尝试:

# import third-party libraries
import pandas as pd
import numpy as np
# Define a function takes the database, and return a dictionary
def splitting_dataframe(df):
    d = {}                                   # Define an empty dictionary
    nousager = np.unique(df.NoUsager.values) # Getting the NoUsage list
    for NU in nousager:                      # Loop over NoUsage list
        d[NU] = df[df.NoUsager == NU]        # I guess this line is what you want most
    return d                                 # Return the dictionary
dictionary = splitting_dataframe(df)  # Calling the function

After this, you can call the DataFrame for specific NoUsager by: 之后,您可以通过以下方式为特定的NoUsager调用DataFrame:

dictionary[target_NoUsager]

Hope this helps. 希望这可以帮助。


EDIT 编辑

If you want to do a box plot, have you tried: 如果要进行箱形图绘制,是否尝试过:

df.boxplot(column='DureeService', by='NoUsager')

directly? 直? More information here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html 此处的更多信息: http : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.boxplot.html


EDIT 编辑

If you want a boxplot for several selected 'NoUsager': 如果要为几个选定的'NoUsager'创建箱线图:

targets = [some selected NoUsagers]
mask = np.sum([df.A.values == targets[i] for i in xrange(len(targets))], dtype=bool, axis=0)
df[mask].boxplot(column='DureeService', by='NoUsager')

If you want a histogram for a selected 'NoUsager': 如果要为所选“ NoUsager”提供直方图:

df[target NoUsager].hist(column='DureeService')

If you still need to separate them, @Psidom's first line is good enough. 如果仍然需要分开它们,@Psidom的第一行就足够了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM