简体   繁体   English

Pandas 矢量化:基于 JSON 文件的累积和

[英]Pandas vectorization: Cumulative sum based on JSON files

I'm trying to sum a score based on values in a DataFrame and two json -files.我正在尝试根据DataFrame和两个json文件中的值对分数求和。 I have a minimum example and a minimum solution, but this needs to be vectorized somehow because in the real-case there are over a million rows, and it took ~40min to run through 1% of the rows.我有一个最小的示例和一个最小的解决方案,但这需要以某种方式进行矢量化,因为在实际情况下有超过一百万行,并且运行 1% 的行需要大约 40 分钟。

My first.json -file is:我的first.json文件是:

{
    "variables" : {
        "var_1": {
            "values": [
                {
                    "lb": 1.0,
                    "b_cumul": 0.04
                },
                {
                    "lb": 3.0,
                    "b_cumul": 0.28
                }
            ]
        },
        "var_2": {
            "values": [
                {
                    "lb": 0,
                    "b_cumul": -0.09
                },
                {
                    "lb": 1,
                    "b_cumul": 0.14
                },
                {
                    "lb:": 4,
                    "b_cumul": 0.03
                }
            ]
        },
        "var_4": {
            "values": [
                {
                    "lb": "1",
                    "b_cumul": 0.06
                }
            ]
        }
    }
}

My second.json file is:我的second.json文件是:

{
    "variables" : {
        "var_1": {
            "values": [
                {
                    "lb": 1.0,
                    "b_cumul": -0.15
                },
                {
                    "lb": 2.0,
                    "b_cumul": 0.06
                },
                {
                    "lb": 4.0,
                    "b_cumul": 0.02
                },
                {
                    "lb": 5.0,
                    "b_cumul": 0.15
                }
            ]
        },
        "var_3": {
            "values": [
                {
                    "lb": 0.0,
                    "b_cumul": 0.12
                },
                {
                    "lb": 2.0,
                    "b_cumul": 0.25
                }
            ]
        },
        "var_6": {
            "values": [
                {
                    "lb": 0.0,
                    "b_cumul": -0.16
                },
                {
                    "lb": 1.0,
                    "b_cumul": -0.06
                }
            ]
        }
    }
}

This is our initial dataframe:这是我们最初的 dataframe:

import pandas as pd
# setup initial test-data
usage = ['first', 'second', 'first', 'second', 'second', 'second', 'first']
var_1 = [-1, -1, 0, 1, 3, 8, 2]
var_2 = [1, 3, -1, 0, 9, 2, 1]
var_3 = [0, 1, 0, -1, -1, 42, 3]
df = pd.DataFrame({'usage': usage, 'var_1': var_1, 'var_2': var_2, 'var_3': var_3})

The variables var_1 , var_2 , var_3 that are available in df , decides which variables we want to look at in the json -files. df中可用的变量var_1var_2var_3决定了我们要在json文件中查看哪些变量。 The scores should be cumulative sums retrieved from the json -files, depending on the values in df .分数应该是从json文件中检索到的累积总和,具体取决于df中的值。

Looking at my first row, I have (var_1=-1, var_2=1, var_3=0) .看着我的第一行,我有(var_1=-1, var_2=1, var_3=0) Since usage='first' for this same row, I need to check in first.json what scores these variables correspond to.由于同一行usage='first' ,我需要检查first.json这些变量对应的分数。 var_3 does not exist in first.json , so this variable gives 0 score. var_3first.json中不存在,因此该变量给出 0 分。 var_1=-1 , so this also gives 0 score. var_1=-1 ,所以这也给出了 0 分。 var_2=1 , so here we need to look in first.json and get the scores for values corresponding to <=1 , which in this case is -0.09+0.14=0.05 . var_2=1 ,所以这里我们需要先查看first.json并获取对应于<=1的值的分数,在这种情况下为-0.09+0.14=0.05 So we want to add this information in the dataframe by df.loc[0, 'score_sum']=0+0-0.09+0.14 .所以我们想通过df.loc[0, 'score_sum']=0+0-0.09+0.14在 dataframe 中添加这些信息。

I have solved this with the code below, but as mentioned before this is very inefficient and does not work for a larger df .我已经用下面的代码解决了这个问题,但如前所述,这是非常低效的,不适用于更大的df

import seaborn as sns
from matplotlib.pyplot import plt
import json

# return score based on given value and score ranges
def calculate_score(value: Number, score_ranges: pd.DataFrame) -> Number:
    score_ranges.lb = pd.to_numeric(score_ranges.lb)
    if score_ranges.lb.min() > value:
        return 0
    score_sum = score_ranges.loc[(score_ranges.lb <= value), 'b_cumul'].sum()
    return score_sum

# read relevant data from json files
models = {key: [] for key in df['usage'].unique()}
for path in models:
    f = open(f"{path}.json")
    all_variables = json.load(f)['variables']
    relevant_variables = [x for x in df.columns if x in all_variables]
    for var in relevant_variables:
        models[path].append({var: all_variables[var]['values']})

# calculate scores
df['score_sum'] = np.nan
for index, row in df.iterrows():
    score = 0
    m = row['usage']
    for var in models[m]:
        var_name = list(var)[0]
        value = row[var_name] 
        if value == -1:
            score += 0
        elif value > -1:
            score += calculate_score(value, pd.DataFrame(var[var_name]))
    df.loc[index, 'score_sum'] = score

By running this code and then printing df , we notice that the first row has score_sum=0.05 as intended.通过运行此代码然后打印df ,我们注意到第一行的score_sum=0.05符合预期。 We import seaborn, plt at the top because we want to run sns.distplot(df['score_sum']) at the end and save the figure.我们import seaborn, plt在顶部,因为我们想在最后运行sns.distplot(df['score_sum'])并保存图形。

EDIT: As requested, see below for a screenshot of the total resulting DataFrame .编辑:根据要求,请参阅下面的总结果DataFrame的屏幕截图。 And just to clarify: For the second row, usage='second' which means we use second.json , however we don't have var_2 in this json so var_2=3 will just add score_sum+=0 , but var_3=1 will add score_sum+=0.12并且只是为了澄清:对于第二行, usage='second'这意味着我们使用second.json ,但是我们在这个json var_2 var_2=3只会添加score_sum+=0 ,但var_3=1会添加score_sum+=0.12

在此处输入图像描述

Solved by looping through the models first as defined by df['usage'].unique() and will have an equivalent amount of json -files.通过首先循环通过由df['usage'].unique()定义的models来解决,并且将具有等效数量的json -files。 Then we loop through each relevant variable in the json , then each incremental value for the given variable.然后我们遍历json中的每个相关变量,然后是给定变量的每个增量值。 We create masks inside each loop, and start by the lowest value since we want to cumulatively add the values.我们在每个循环中创建掩码,并从最低值开始,因为我们想要累积添加这些值。

df['new_scores'] = 0
for model in models:
    for var in models[model]:
        var_name = list(var)[0]
        json_values = pd.DataFrame(var[var_name])
        json_values = json_values.set_index('lb')
        for value in json_values.index:
            mask = (df[var_name] >= float(value)) & (df['usage'] == model)
            df.loc[mask, 'new_scores'] += json_values.loc[value, 'b_cumul']

When comparing the column from the old solution with the column from the new solution, the values are exactly the same.将旧解决方案中的列与新解决方案中的列进行比较时,值完全相同。 For the original dataframe of ~6M rows, it went from taking ~4h to iterate through 5% of the dataframe , to ~70s to run through the whole script.对于约 6M 行的原始 dataframe ,它需要约 4 小时来迭代dataframe的 5%,到约 70 秒来运行整个脚本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM