简体   繁体   English

数据清洗python数据框

[英]data cleaning a python dataframe

I have a Python dataframe with 1408 lines of data. 我有一个包含1408行数据的Python数据框。 My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. 我的目标是将一周中与给定工作日相关的最大和最小数字与一周前同一天发生的最大/最小的下周数字进行比较。 Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. 本质上,我想研究五分位数(因为一个工作周内有5天)的排名1和5,并观察它们在一周之间的变化情况。 Build a cdf of numbers associated to each weekday. 建立与每个工作日相关的数字的cdf。

  1. To clean the data, I need to remove 18 weeks in total from it. 要清除数据,我总共需要删除18周。 That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred. 即,与假期相关联的数据框中的每周以及假期发生后下一周的整个星期。

  2. After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). 在此之后,我想我应该在数据框中插入一列,用文件的所有日期(星期一至星期五)标记我的所有数据(文件中有6年的数据)。 The reason for labeling MF is so that I can sort each number associated to the day of the week in ascending order. 标记MF的原因是,我可以按升序对与星期几相关的每个数字进行排序。 And query on the day of the week. 并在星期几查询。

Methodological suggestions on either 1. or 2. or both would be immensely appreciated. 关于1.或2.或两者的方法学建议将不胜感激。

Thank you! 谢谢!

#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. #2似乎最好在最终的Groupby对象上结合使用df.groupby()apply()来解决。 Perhaps an example is the best way to explain. 也许一个例子是最好的解释方法。

Given a dataframe: 给定一个数据框:

In [53]: df
Out[53]: 
            Value
2012-08-01     61
2012-08-02     52
2012-08-03     89
2012-08-06     44
2012-08-07     35
2012-08-08     98
2012-08-09     64
2012-08-10     48
2012-08-13    100
2012-08-14     95
2012-08-15     14
2012-08-16     55
2012-08-17     58
2012-08-20     11
2012-08-21     28
2012-08-22     95
2012-08-23     18
2012-08-24     81
2012-08-27     27
2012-08-28     81
2012-08-29     28
2012-08-30     16
2012-08-31     50

In [54]: def rankdays(df):
  .....:    if len(df) != 5:
  .....:        return pandas.Series()
  .....:    return pandas.Series(df.Value.rank(), index=df.index.weekday)
  .....: 

In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]: 
    0  1  2  3  4
32  2  1  5  4  3
33  5  4  1  2  3
34  1  3  5  2  4
35  2  5  3  1  4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM