简体   繁体   English

在python中运行countifs的更快方法

[英]faster way to run countifs in python

I previously asked the question on how to do a countifs in python across multiple data frames, just like you can do countifs on separate worksheets in Excel. 我之前曾问过有关如何在多个数据帧中使用python进行计数的问题,就像您可以在Excel中的单独工作表上进行计数一样。 somebody gave me a very creative answer: 有人给我一个非常有创意的答案:

python pandas countifs using multiple criteria AND multiple data frames 使用多个条件和多个数据框的python pandas countifs

Thank you for that @AlexG--I tried it, and it worked superbly: 谢谢@AlexG的支持-我尝试了一下,它的效果非常好:

import pandas as pd
import numpy as np
import matplotlib as plt

#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")

#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values

#prepare an empty list to hold the results
N_exams_passed = []

#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
                         (exams['Exam Grade Date']>=s_enroll) &
                         (exams['Exam Grade Date']<=s_qual) &
                         (exams['Exam Grade']>=70)])
                      )

#add the results to the original data set
students['Exams Passed'] = N_exams_passed

HOWEVER, it only worked effectively on small data sets. 但是,它仅在小型数据集上有效地工作。 When I ran the data with 100,000s of rows, it wouldn't even be done overnight. 当我用十万行的数据运行数据时,它甚至不可能一overnight而就。 It doesn't seem very pythonic. 它似乎不是很pythonic。

The SQL way you can do this in seconds is to use a correlated subquery, like this: 您可以在几秒钟内完成此操作的SQL方法是使用相关子查询,如下所示:

SELECT
   s.*,
   (SELECT COUNT(e.[Exam Grade]) 
 FROM
     exams AS e 
 WHERE
    e.[Exam Grade] >= 65 
    AND e.[Student Number] = s.[Student Number] 
    AND e.[Exam Grade Date] >= s.[Enrollment Date] 
    AND e.[Exam Grade Date] <= s.[Detail Date]) 
    AS ExamsPassed
FROM 
    students AS s;

How do I reproduce such a correlated subquery in pandas or some other pythonic way? 如何以熊猫或其他pythonic方式重现此类相关子查询?

Here are the data frames: 以下是数据帧:

 #Students
 Student Number Enroll Date Detail Date
 1              1/1/2016    2/1/2016
 1              1/1/2016    3/1/2016
 2              2/1/2016    3/1/2016
 3              3/1/2016    4/1/2016

 #Exams
 Student Number Exam Date   Exam Grade
 1              1/1/2016    50
 1              1/15/2016   80
 1              1/28/2016   90
 1              2/5/2016    100
 1              3/5/2016    80
 1              4/5/2016    40
 2              2/2/2016    85
 2              2/3/2016    10
 2              2/4/2016    100

Final data frame should look like this, with a count of 'Passed Exams' at the end: 最终数据帧应如下所示,最后是“通过的考试”:

 #FinalResult
 Student Number Enroll Date Detail Date Passed Exams
 1              1/1/2016    2/1/2016    2
 1              1/1/2016    3/1/2016    3
 2              2/1/2016    3/1/2016    2
 3              3/1/2016    4/1/2016    0

If I understand the structure of your dataframes correctly, I'd suggest merging the two dataframes and then performing the task on the merged data using numpy.where . 如果我正确理解了数据帧的结构,建议合并两个数据帧,然后使用numpy.where对合并后的数据执行任务。

import numpy as np

exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
    (exams['Exam Grade Date'] >= exams['Enrollment Date']) &
    (exams['Exam Grade Date'] <= exams['Detail Date']) &
    (exams['Grade'] >= 70),
    1, 0)

students = students.merge(
    exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
    left_on=['Student Number', 'Detail Date'],
    right_on=['Student Number', 'Detail Date'],
    how='left')
students['Passed'] = students['Passed'].fillna(0).astype('int')

Note: you'll need to make sure the date columns are properly stored as datetimes (you can use pandas.to_datetime to do this). 注意:您需要确保将日期列正确地存储为日期时间(可以使用pandas.to_datetime进行此操作)。

numpy.where creates a new array where the values are one way ( 1 in the example above) if the conditions you specify are met and another ( 0 ) if they aren't met. numpy.where创建一个新数组,如果满足指定条件,则值是一种方式(上例中为1 ),如果不满足,则为另一种方式( 0 )。

The line exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum() produces a series in which the index is Student Number and Detail Date and the values are the counts of passed exams corresponding to that Student Number and Detail Date combination. exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum()生成一个序列,其中索引为Student NumberDetail Date ,并且值是与通过的考试相对应的计数该Student NumberDetail Date组合。 The reset_index() makes it into a dataframe for merging. reset_index()使它成为一个数据reset_index()以进行合并。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM