简体   繁体   English

Pandas 如何过滤一个系列

[英]Pandas How to filter a Series

I have a Series like this after doing groupby('name') and used mean() function on other column在执行 groupby('name') 并在其他列上使用 mean() 函数后,我有一个这样的系列

name
383      3.000000
663      1.000000
726      1.000000
737      9.000000
833      8.166667

Could anyone please show me how to filter out the rows with 1.000000 mean values?谁能告诉我如何过滤掉平均值为 1.000000 的行? Thank you and I greatly appreciate your help.谢谢你,我非常感谢你的帮助。

In [5]:

import pandas as pd

test = {
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
}

s = pd.Series(test)
s = s[s != 1]
s
Out[0]:
383    3.000000
737    9.000000
833    8.166667
dtype: float64

From pandas version 0.18+ filtering a series can also be done as below从熊猫版本 0.18+ 过滤一系列也可以完成如下

test = {
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
}

pd.Series(test).where(lambda x : x!=1).dropna()

Checkout: http://pandas.pydata.org/pandas-docs/version/0.18.1/whatsnew.html#method-chaininng-improvements结帐: http : //pandas.pydata.org/pandas-docs/version/0.18.1/whatsnew.html#method-chaininng-improvements

As DACW pointed out , there are method-chaining improvements in pandas 0.18.1 that do what you are looking for very nicely.正如DACW 指出的那样,pandas 0.18.1 中的方法链改进可以很好地满足您的需求。

Rather than using .where , you can pass your function to either the .loc indexer or the Series indexer [] and avoid the call to .dropna :而不是使用.where ,你可以通过你的功能,无论是.loc索引或索引系列[]避免调用.dropna

test = pd.Series({
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
})

test.loc[lambda x : x!=1]

test[lambda x: x!=1]

Similar behavior is supported on the DataFrame and NDFrame classes. DataFrame 和 NDFrame 类支持类似的行为。

A fast way of doing this is to reconstruct using numpy to slice the underlying arrays.一种快速的方法是使用numpy重构底层数组。 See timings below.请参阅下面的时间。

mask = s.values != 1
pd.Series(s.values[mask], s.index[mask])

0
383    3.000000
737    9.000000
833    8.166667
dtype: float64

naive timing天真的时机

在此处输入图片说明

Another way is to first convert to a DataFrame and use the query method (assuming you have numexpr installed):另一种方法是首先转换为DataFrame并使用查询方法(假设您安装了numexpr):

import pandas as pd

test = {
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
}

s = pd.Series(test)
s.to_frame(name='x').query("x != 1")

If you like a chained operation, you can also use compress function:如果你喜欢链式操作,你也可以使用compress函数:

test = pd.Series({
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
})

test.compress(lambda x: x != 1)

# 383    3.000000
# 737    9.000000
# 833    8.166667
# dtype: float64

In my case I had a panda Series where the values are tuples of characters :就我而言,我有一个熊猫系列,其中值是字符元组

Out[67]
0    (H, H, H, H)
1    (H, H, H, T)
2    (H, H, T, H)
3    (H, H, T, T)
4    (H, T, H, H)

Therefore I could use indexing to filter the series, but to create the index I needed apply .因此,我可以使用索引来过滤系列,但要创建我需要的索引apply My condition is "find all tuples which have exactly one 'H'".我的条件是“找到所有正好有一个‘H’的元组”。

series_of_tuples[series_of_tuples.apply(lambda x: x.count('H')==1)]

I admit it is not "chainable" , (ie notice I repeat series_of_tuples twice; you must store any temporary series into a variable so you can call apply(...) on it).我承认它不是“可链接的” ,(即注意我重复了series_of_tuples两次;您必须将任何临时系列存储到一个变量中,以便您可以对其调用 apply(...) )。

There may also be other methods (besides .apply(...) ) which can operate elementwise to produce a Boolean index.可能还有其他方法(除了.apply(...) )可以按元素操作以生成布尔索引。

Many other answers (including accepted answer) using the chainable functions like:使用可链接函数的许多其他答案(包括已接受的答案),例如:

  • .compress()
  • .where()
  • .loc[]
  • []

These accept callables (lambdas) which are applied to the Series , not to the individual values in those series!这些接受应用于 Series 的可调用对象(lambdas) ,而不是这些系列中的单个

Therefore my Series of tuples behaved strangely when I tried to use my above condition / callable / lambda, with any of the chainable functions, like .loc[] :因此,当我尝试将上述条件/可调用/lambda 与任何可链接函数(如.loc[]一起使用时,我的元组系列表现得很奇怪:

series_of_tuples.loc[lambda x: x.count('H')==1]

Produces the error:产生错误:

KeyError: 'Level H must be same as name (None)' KeyError:'级别 H 必须与名称相同(无)'

I was very confused, but it seems to be using the Series.count series_of_tuples.count(...) function , which is not what I wanted.我很困惑,但它似乎正在使用Series.count series_of_tuples.count(...)函数,这不是我想要的。

I admit that an alternative data structure may be better:我承认另一种数据结构可能更好:

  • A Category datatype?类别数据类型?
  • A Dataframe (each element of the tuple becomes a column)一个数据框(元组的每个元素都变成一列)
  • A Series of strings (just concatenate the tuples together):一系列字符串(只需将元组连接在一起):

This creates a series of strings (ie by concatenating the tuple; joining the characters in the tuple on a single string)这将创建一系列字符串(即通过连接元组;将元组中的字符连接到单个字符串上)

series_of_tuples.apply(''.join)

So I can then use the chainable Series.str.count所以我可以使用可链接的Series.str.count

series_of_tuples.apply(''.join).str.count('H')==1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM