简体   繁体   English

Python Pandas:子类系列和数据框

[英]Python pandas: subclass series and dataframes

I want to add new methods and attributes to pandas Series and DataFrames . 我想向pandas SeriesDataFrames添加新的方法和属性。 Here is very simplistic example: I want a method that counts the number of time the difference between a row and the previous one is not 1. 这是一个非常简单的示例:我想要一种计算行与上一行之间的差不是1的时间的方法。

Here is what I had so far by sub-classing the pandas objects: 到目前为止,这是对熊猫对象进行子类化的结果:

import pandas as pd

class Serie(pd.Series):

    def gaps(self):
        return (self.diff().fillna(1) != 1).sum()

class DataSet(pd.DataFrame):

    _constructor_sliced = Serie

But based on this answer it seems that I can do this instead: 但是基于此答案 ,看来我可以代替此操作:

def gaps(self):
    return (self.diff().fillna(1) != 1).sum()

pd.Series.gaps = gaps

It seems to work equally well! 似乎同样有效!

In[1]: df = pd.DataFrame({'A':[1,2,4], 'B':[3,2,1]})
In[2]: df.A.gaps()
Out[2]: 1

Now the question: what is the best practice for such situations? 现在的问题是:在这种情况下的最佳实践是什么? Second option seems much simpler than subclassing but I may be missing something... Are there caveats from doing that? 第二种选择似乎比子类化要简单得多,但是我可能会遗漏一些东西……这样做有什么警告吗? Or maybe there are other options I missed. 也许我错过了其他选择。

A very simple solution would be too just use a function like your gaps function but renaming "self" to "serie": 一个非常简单的解决方案就是只使用像gaps函数之类的函数,但是将“ self”重命名为“ serie”:

def gaps(serie):
    return (serie.diff().fillna(1) != 1).sum()

It keeps concerns cleanly separated (your code vs pandas code). 它使关注点清晰地分开(您的代码vs熊猫代码)。

It's more readable: You don't need to understand a lot of things to understand how it works. 它更具可读性:您无需了解很多知识即可了解其工作原理。 It's just simple. 很简单。

It's less surprising: A dev on your team may spend some time trying to search for the gap() documentation in the Pandas' Series documentation and not finding it, just to discover a few hours later that someone (you) monkey patched it. 这并不奇怪:您团队中的开发人员可能会花一些时间尝试在Pandas系列文档中搜索gap()文档,但没有找到它,只是发现几个小时后有人(您)给猴子打了补丁。

It's also the shortest solution. 这也是最短的解决方案。

It avoids using "private" members like _constructor_sliced which name may change in the future and break your implementation. 它避免使用像_constructor_sliced这样的“私有”成员,该成员将来可能会更改并破坏您的实现。

It avoids future conflicts: What about the next release of pandas include a gap method in the Series object? 它避免了将来的冲突:下一版的熊猫在Series对象中包含间隙方法怎么办? It won't directly break, but I'll bit a developper in your team, wanting to use the "now well known .gap()", not being aware you "changed it", and having hard times debugging why it does not work according to the documentation. 它不会直接中断,但我会在您的团队中咬一下开发人员,想使用“现在众所周知的.gap()”,不知道您在“更改”它,并且很难调试为什么不这样做根据文档进行工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM