[英]faster way to run countifs in python
I previously asked the question on how to do a countifs in python across multiple data frames, just like you can do countifs on separate worksheets in Excel. 我之前曾问过有关如何在多个数据帧中使用python进行计数的问题,就像您可以在Excel中的单独工作表上进行计数一样。 somebody gave me a very creative answer:
有人给我一个非常有创意的答案:
python pandas countifs using multiple criteria AND multiple data frames 使用多个条件和多个数据框的python pandas countifs
Thank you for that @AlexG--I tried it, and it worked superbly: 谢谢@AlexG的支持-我尝试了一下,它的效果非常好:
import pandas as pd
import numpy as np
import matplotlib as plt
#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")
#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values
#prepare an empty list to hold the results
N_exams_passed = []
#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
(exams['Exam Grade Date']>=s_enroll) &
(exams['Exam Grade Date']<=s_qual) &
(exams['Exam Grade']>=70)])
)
#add the results to the original data set
students['Exams Passed'] = N_exams_passed
HOWEVER, it only worked effectively on small data sets. 但是,它仅在小型数据集上有效地工作。 When I ran the data with 100,000s of rows, it wouldn't even be done overnight.
当我用十万行的数据运行数据时,它甚至不可能一overnight而就。 It doesn't seem very pythonic.
它似乎不是很pythonic。
The SQL way you can do this in seconds is to use a correlated subquery, like this: 您可以在几秒钟内完成此操作的SQL方法是使用相关子查询,如下所示:
SELECT
s.*,
(SELECT COUNT(e.[Exam Grade])
FROM
exams AS e
WHERE
e.[Exam Grade] >= 65
AND e.[Student Number] = s.[Student Number]
AND e.[Exam Grade Date] >= s.[Enrollment Date]
AND e.[Exam Grade Date] <= s.[Detail Date])
AS ExamsPassed
FROM
students AS s;
How do I reproduce such a correlated subquery in pandas or some other pythonic way? 如何以熊猫或其他pythonic方式重现此类相关子查询?
Here are the data frames: 以下是数据帧:
#Students
Student Number Enroll Date Detail Date
1 1/1/2016 2/1/2016
1 1/1/2016 3/1/2016
2 2/1/2016 3/1/2016
3 3/1/2016 4/1/2016
#Exams
Student Number Exam Date Exam Grade
1 1/1/2016 50
1 1/15/2016 80
1 1/28/2016 90
1 2/5/2016 100
1 3/5/2016 80
1 4/5/2016 40
2 2/2/2016 85
2 2/3/2016 10
2 2/4/2016 100
Final data frame should look like this, with a count of 'Passed Exams' at the end: 最终数据帧应如下所示,最后是“通过的考试”:
#FinalResult
Student Number Enroll Date Detail Date Passed Exams
1 1/1/2016 2/1/2016 2
1 1/1/2016 3/1/2016 3
2 2/1/2016 3/1/2016 2
3 3/1/2016 4/1/2016 0
If I understand the structure of your dataframes correctly, I'd suggest merging the two dataframes and then performing the task on the merged data using numpy.where
. 如果我正确理解了数据帧的结构,建议合并两个数据帧,然后使用
numpy.where
对合并后的数据执行任务。
import numpy as np
exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
(exams['Exam Grade Date'] >= exams['Enrollment Date']) &
(exams['Exam Grade Date'] <= exams['Detail Date']) &
(exams['Grade'] >= 70),
1, 0)
students = students.merge(
exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
left_on=['Student Number', 'Detail Date'],
right_on=['Student Number', 'Detail Date'],
how='left')
students['Passed'] = students['Passed'].fillna(0).astype('int')
Note: you'll need to make sure the date columns are properly stored as datetimes (you can use pandas.to_datetime
to do this). 注意:您需要确保将日期列正确地存储为日期时间(可以使用
pandas.to_datetime
进行此操作)。
numpy.where
creates a new array where the values are one way ( 1
in the example above) if the conditions you specify are met and another ( 0
) if they aren't met. numpy.where
创建一个新数组,如果满足指定条件,则值是一种方式(上例中为1
),如果不满足,则为另一种方式( 0
)。
The line exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum()
produces a series in which the index is Student Number
and Detail Date
and the values are the counts of passed exams corresponding to that Student Number
and Detail Date
combination. 行
exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum()
生成一个序列,其中索引为Student Number
和Detail Date
,并且值是与通过的考试相对应的计数该Student Number
和Detail Date
组合。 The reset_index()
makes it into a dataframe for merging. reset_index()
使它成为一个数据reset_index()
以进行合并。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.