[英]Is there a query method or similar for pandas Series (pandas.Series.query())?
The pandas.DataFrame.query()
method is of great usage for (pre/post)-filtering data when loading or plotting. pandas.DataFrame.query()
方法在加载或绘图时非常适用于(前/后)过滤数据。 It comes particularly handy for method chaining. 它对于方法链尤特别方便。
I find myself often wanting to apply the same logic to a pandas.Series
, eg after having done a method such as df.value_counts
which returns a pandas.Series
. 我发现自己常常想同样的逻辑适用于
pandas.Series
,例如,具有做了一个方法,如后df.value_counts
返回一个pandas.Series
。
Lets assume there is a huge table with the columns Player, Game, Points
and I want to plot a histogram of the players with more than 14 times 3 points. 让我们假设有一个巨大的表格,其中列有
Player, Game, Points
,我想绘制一个超过14次3分的玩家直方图。 I first have to sum the points of each player ( groupby -> agg
) which will return a Series of ~1000 players and their overall points. 我首先必须总结每个玩家的分数(
groupby -> agg
),这将返回一系列约1000名玩家及其总分。 Applying the .query
logic it would look something like this: 应用
.query
逻辑,它看起来像这样:
df = pd.DataFrame({
'Points': [random.choice([1,3]) for x in range(100)],
'Player': [random.choice(["A","B","C"]) for x in range(100)]})
(df
.query("Points == 3")
.Player.values_count()
.query("> 14")
.hist())
The only solutions I find force me to do an unnecessary assignment and break the method chaining: 我找到的唯一解决方案迫使我做一个不必要的任务并打破方法链:
(points_series = df
.query("Points == 3")
.groupby("Player").size()
points_series[points_series > 100].hist()
Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly. 方法链接以及查询方法有助于保持代码清晰,同时子集化过滤可以很快变得混乱。
# just to make my point :)
series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape
Please help me out of my dilemma! 请帮助我摆脱困境! Thanks
谢谢
IIUC you can add query("Points > 100")
: IIUC你可以添加
query("Points > 100")
:
df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],
'Player':['a','a','a','s','s','s']})
print (df)
Player Points
0 a 50.000000
1 a 20.000000
2 a 38.000000
3 s 90.000000
4 s 0.000000
5 s inf
points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points']
print (points_series)
a = points_series[points_series > 100]
print (a)
Player
a 108.0
Name: Points, dtype: float64
points_series = df.query("Points < inf")
.groupby("Player")
.agg({"Points": "sum"})
.query("Points > 100")
print (points_series)
Points
Player
a 108.0
Another solution is Selection By Callable : 另一个解决方案是Select By Callable :
points_series = df.query("Points < inf")
.groupby("Player")
.agg({"Points": "sum"})['Points']
.loc[lambda x: x > 100]
print (points_series)
Player
a 108.0
Name: Points, dtype: float64
Edited answer by edited question: 编辑问题编辑的答案:
np.random.seed(1234)
df = pd.DataFrame({
'Points': [np.random.choice([1,3]) for x in range(100)],
'Player': [np.random.choice(["A","B","C"]) for x in range(100)]})
print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15])
C 19
B 16
Name: Player, dtype: int64
print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15])
Player
B 16
C 19
dtype: int64
Why not convert from Series to DataFrame, do the querying, and then convert back. 为什么不从Series转换为DataFrame,进行查询,然后转换回来。
df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"]
Here, .to_frame()
converts to DataFrame, while the trailing ["Points"]
converts to Series. 这里,
.to_frame()
转换为DataFrame,而尾随["Points"]
转换为Series。
The method .query()
can then be used consistently whether or not the Pandas object has 1 or more columns. 无论Pandas对象是否包含1列或更多列,都可以一致地使用方法
.query()
。
而不是查询,您可以使用pipe
:
s.pipe(lambda x: x[x>0]).pipe(lambda x: x[x<10])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.