简体   繁体   English

如何对 Pandas 数据框进行二分搜索以获取列值的组合?

[英]How do I binary search a pandas dataframe for a combination of column values?

Sorry if this is a simple question that the pandas documentation explains, but I've tried searching for how to do this and haven't had any luck.对不起,如果这是熊猫文档解释的一个简单问题,但我已经尝试寻找如何做到这一点并且没有任何运气。

I have a pandas datafame with several columns, and I want to be able to search for a particular row using binary search since my dataset is big and I'll be doing a lot of searches.我有一个包含多列的 Pandas datafame,我希望能够使用二分搜索搜索特定行,因为我的数据集很大,我将进行大量搜索。

My data looks like this:我的数据如下所示:

Name           Course   Week  Grade
-------------  -------  ----  -----
Homer Simpson  MATH001  1     97
Homer Simpson  MATH001  3     85
Homer Simpson  CSCI100  1     89
John McGuirk   MATH001  2     78
John McGuirk   CSCI100  1     100
John McGuirk   CSCI100  2     96

I want to be able to search my data quickly for a specific combination of name, course, and week.我希望能够快速搜索我的数据以查找名称、课程和周的特定组合。 Each distinct combination of name, course, and week will have either zero or one row in the dataset.名称、课程和周的每个不同组合在数据集中都有零或一行。 If there is a missing value for the combination of name, course, and week that I'm searching for, I want my search to return 0.如果我正在搜索的名称、课程和周的组合缺少值,我希望我的搜索返回 0。

For instance, I would like to search for the value (John McGuirk, CSCI100, 1)例如,我想搜索值(John McGuirk, CSCI100, 1)

Is there a built in way to do this, or do I have to write my own binary search?有没有内置的方法来做到这一点,还是我必须编写自己的二进制搜索?

Update:更新:

I tried doing this using the built-in way that was suggested by one of the commenters below, and I also tried doing it with a custom binary search that's written for my specific data, and another custom binary search that uses recursion to handle different columns than my specific example.我尝试使用下面一位评论者建议的内置方式执行此操作,我还尝试使用为我的特定数据编写的自定义二进制搜索和另一个使用递归处理不同列的自定义二进制搜索来执行此操作比我的具体例子。

The data frame for these tests contains 10,000 rows.这些测试的数据框包含 10,000 行。 I put the timings below.我把时间放在下面。 Both binary searches performed better than using [...] to get rows.两种二进制搜索的性能都比使用[...]来获取行要好。 I'm far from a Python expert, so I'm not sure how well optimized my code is.我远非 Python 专家,所以我不确定我的代码优化得如何。

# Load data
from pandas import DataFrame, read_csv
import math
import pandas as pd
import time

file = 'grades.xlsx'
df = pd.read_excel(file)

# This was suggested by one of the commenters below
def get_grade(name, course, week):
    mask = (df.name.values == name) & (df.course.values == course) & (df.week.values == week)
    row = df[mask]
    if row.empty == False:
        return row.grade.values[0]
    else:
        return 0

# Binary search that is specific to my particular data
def get_grade_binary_search(name, course, week):
    lower = 0
    upper = len(df.index) - 1

    while lower <= upper:
        mid = math.floor((lower + upper) / 2)

        row_name = df.iat[mid, 0]            
        if name < row_name:
            upper = mid - 1
        elif name > row_name:
            lower = mid + 1
        else:
            row_course = df.iat[mid, 1]
            if course < row_course:
                upper = mid - 1
            elif course > row_course:
                lower = mid + 1
            else:
                row_week = df.iat[mid, 2]
                if week < row_week:
                    upper = mid - 1
                elif week > row_week:
                    lower = mid + 1
                else:
                    return df.iat[mid, 3]

    return 0    

# General purpose binary search
def get_grade_binary_search_recursive(search_value):
    lower = 0
    upper = len(df.index) - 1

    while lower <= upper:
        mid = math.floor((lower + upper) / 2)

        comparison = compare(search_value, 0, mid)

        if comparison < 0:
            upper = mid - 1
        elif comparison > 0:
            lower = mid + 1
        else:
            return df.iat[mid, len(search_value)]

# Utility method
def compare(search_value, search_column_index, df_value_index):      
    if search_column_index >= len(search_value):
        return 0

    if search_value[search_column_index] < df.iat[df_value_index, search_column_index]:
        return -1
    elif search_value[search_column_index] > df.iat[df_value_index, search_column_index]:
        return 1
    else:
        return compare(search_value, search_column_index + 1, df_value_index)

Here are the timings.以下是时间安排。 I also printed the sum of the returned values from each search to verify that the same rows are getting returned.我还打印了每次搜索返回值的总和,以验证是否返回了相同的行。

# Non binary search
sum_of_grades = 0
start = time.time()   
for week in range(first_week, last_week + 1):
    for name in names:
        for course in courses:
            val = get_grade(name, course, week)
            sum_of_grades += val                
end = time.time()    
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)

elapsed time: 26.130020141601562

sum of grades: 498724

# Binary search specific to this data
sum_of_grades = 0
start = time.time()    
for week in range(first_week, last_week + 1):
    for name in names:
        for course in courses:
            val = get_grade_binary_search(name, course, week)
            sum_of_grades += val

end = time.time()    
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)

elapsed time: 4.4506165981292725

sum of grades: 498724

# Binary search with recursion
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
    for name in names:
        for course in courses:
            val = get_grade_binary_search_recursive([name, course, week])
            sum_of_grades += val           
end = time.time()    
print('elapsed time: ', end - start)
print('sum_of_grades: ', sum_of_grades)

elapsed time: 7.559535264968872

sum_of_grades: 498724

Pandas has searchsorted . Pandas 已搜索排序.

From the Notes :注释

Binary search is used to find the required insertion points.二分查找用于查找所需的插入点。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何对pandas DataFrame中的值进行离散化并转换为二进制矩阵? - How do I discretize values in a pandas DataFrame and convert to a binary matrix? python-我如何从 pandas dataframe 中删除 2 列值的行(这些值应该是 2 个字符串的组合)? - python- How do i remove a rows from pandas dataframe by 2 columns value (The values should be a combination of 2 strings )? 如何使用Pandas更新数据框列值? - How do I update dataframe column values with Pandas? 如何在 Pandas Dataframe 中获取行并转换为列的值? - How do I take rows in Pandas Dataframe and transform into values for a Column? 如何检查 pandas dataframe 列中的所有值是否相等? - How do I check if all values in a column of a pandas dataframe are equal? 如何计算 pandas DataFrame 列中的 NaN 值? - How do I count the NaN values in a column in pandas DataFrame? 如何在 Pandas 数据框列中仅过滤出一种组合 - How to I filter out only one combination in a pandas dataframe column 如何在 pandas 的 2 列 dataframe 中找到唯一组合的计数 - how do I find count of unique combination in 2 columns of dataframe in pandas 如何根据 pandas dataframe 中另一列的多个值在一列中创建值列表? - How do I create a list of values in a column from several values from another column in a pandas dataframe? 给定pandas数据帧中的二进制列,如何将前面的0更改为1? - Given a binary column in a pandas dataframe, how I change the preceding 0 to 1?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM