[英]Pandas vectorization: Cumulative sum based on JSON files
I'm trying to sum a score based on values in a DataFrame
and two json
-files.我正在尝试根据
DataFrame
和两个json
文件中的值对分数求和。 I have a minimum example and a minimum solution, but this needs to be vectorized somehow because in the real-case there are over a million rows, and it took ~40min to run through 1% of the rows.我有一个最小的示例和一个最小的解决方案,但这需要以某种方式进行矢量化,因为在实际情况下有超过一百万行,并且运行 1% 的行需要大约 40 分钟。
My first.json
-file is:我的
first.json
文件是:
{
"variables" : {
"var_1": {
"values": [
{
"lb": 1.0,
"b_cumul": 0.04
},
{
"lb": 3.0,
"b_cumul": 0.28
}
]
},
"var_2": {
"values": [
{
"lb": 0,
"b_cumul": -0.09
},
{
"lb": 1,
"b_cumul": 0.14
},
{
"lb:": 4,
"b_cumul": 0.03
}
]
},
"var_4": {
"values": [
{
"lb": "1",
"b_cumul": 0.06
}
]
}
}
}
My second.json
file is:我的
second.json
文件是:
{
"variables" : {
"var_1": {
"values": [
{
"lb": 1.0,
"b_cumul": -0.15
},
{
"lb": 2.0,
"b_cumul": 0.06
},
{
"lb": 4.0,
"b_cumul": 0.02
},
{
"lb": 5.0,
"b_cumul": 0.15
}
]
},
"var_3": {
"values": [
{
"lb": 0.0,
"b_cumul": 0.12
},
{
"lb": 2.0,
"b_cumul": 0.25
}
]
},
"var_6": {
"values": [
{
"lb": 0.0,
"b_cumul": -0.16
},
{
"lb": 1.0,
"b_cumul": -0.06
}
]
}
}
}
This is our initial dataframe:这是我们最初的 dataframe:
import pandas as pd
# setup initial test-data
usage = ['first', 'second', 'first', 'second', 'second', 'second', 'first']
var_1 = [-1, -1, 0, 1, 3, 8, 2]
var_2 = [1, 3, -1, 0, 9, 2, 1]
var_3 = [0, 1, 0, -1, -1, 42, 3]
df = pd.DataFrame({'usage': usage, 'var_1': var_1, 'var_2': var_2, 'var_3': var_3})
The variables var_1
, var_2
, var_3
that are available in df
, decides which variables we want to look at in the json
-files. df
中可用的变量var_1
、 var_2
、 var_3
决定了我们要在json
文件中查看哪些变量。 The scores should be cumulative sums retrieved from the json
-files, depending on the values in df
.分数应该是从
json
文件中检索到的累积总和,具体取决于df
中的值。
Looking at my first row, I have (var_1=-1, var_2=1, var_3=0)
.看着我的第一行,我有
(var_1=-1, var_2=1, var_3=0)
。 Since usage='first'
for this same row, I need to check in first.json
what scores these variables correspond to.由于同一行
usage='first'
,我需要检查first.json
这些变量对应的分数。 var_3
does not exist in first.json
, so this variable gives 0 score. var_3
在first.json
中不存在,因此该变量给出 0 分。 var_1=-1
, so this also gives 0 score. var_1=-1
,所以这也给出了 0 分。 var_2=1
, so here we need to look in first.json
and get the scores for values corresponding to <=1
, which in this case is -0.09+0.14=0.05
. var_2=1
,所以这里我们需要先查看first.json
并获取对应于<=1
的值的分数,在这种情况下为-0.09+0.14=0.05
。 So we want to add this information in the dataframe by df.loc[0, 'score_sum']=0+0-0.09+0.14
.所以我们想通过
df.loc[0, 'score_sum']=0+0-0.09+0.14
在 dataframe 中添加这些信息。
I have solved this with the code below, but as mentioned before this is very inefficient and does not work for a larger df
.我已经用下面的代码解决了这个问题,但如前所述,这是非常低效的,不适用于更大的
df
。
import seaborn as sns
from matplotlib.pyplot import plt
import json
# return score based on given value and score ranges
def calculate_score(value: Number, score_ranges: pd.DataFrame) -> Number:
score_ranges.lb = pd.to_numeric(score_ranges.lb)
if score_ranges.lb.min() > value:
return 0
score_sum = score_ranges.loc[(score_ranges.lb <= value), 'b_cumul'].sum()
return score_sum
# read relevant data from json files
models = {key: [] for key in df['usage'].unique()}
for path in models:
f = open(f"{path}.json")
all_variables = json.load(f)['variables']
relevant_variables = [x for x in df.columns if x in all_variables]
for var in relevant_variables:
models[path].append({var: all_variables[var]['values']})
# calculate scores
df['score_sum'] = np.nan
for index, row in df.iterrows():
score = 0
m = row['usage']
for var in models[m]:
var_name = list(var)[0]
value = row[var_name]
if value == -1:
score += 0
elif value > -1:
score += calculate_score(value, pd.DataFrame(var[var_name]))
df.loc[index, 'score_sum'] = score
By running this code and then printing df
, we notice that the first row has score_sum=0.05
as intended.通过运行此代码然后打印
df
,我们注意到第一行的score_sum=0.05
符合预期。 We import seaborn, plt
at the top because we want to run sns.distplot(df['score_sum'])
at the end and save the figure.我们
import seaborn, plt
在顶部,因为我们想在最后运行sns.distplot(df['score_sum'])
并保存图形。
EDIT: As requested, see below for a screenshot of the total resulting DataFrame
.编辑:根据要求,请参阅下面的总结果
DataFrame
的屏幕截图。 And just to clarify: For the second row, usage='second'
which means we use second.json
, however we don't have var_2
in this json
so var_2=3
will just add score_sum+=0
, but var_3=1
will add score_sum+=0.12
并且只是为了澄清:对于第二行,
usage='second'
这意味着我们使用second.json
,但是我们在这个json
var_2
var_2=3
只会添加score_sum+=0
,但var_3=1
会添加score_sum+=0.12
Solved by looping through the models
first as defined by df['usage'].unique()
and will have an equivalent amount of json
-files.通过首先循环通过由
df['usage'].unique()
定义的models
来解决,并且将具有等效数量的json
-files。 Then we loop through each relevant variable in the json
, then each incremental value for the given variable.然后我们遍历
json
中的每个相关变量,然后是给定变量的每个增量值。 We create masks inside each loop, and start by the lowest value since we want to cumulatively add the values.我们在每个循环中创建掩码,并从最低值开始,因为我们想要累积添加这些值。
df['new_scores'] = 0
for model in models:
for var in models[model]:
var_name = list(var)[0]
json_values = pd.DataFrame(var[var_name])
json_values = json_values.set_index('lb')
for value in json_values.index:
mask = (df[var_name] >= float(value)) & (df['usage'] == model)
df.loc[mask, 'new_scores'] += json_values.loc[value, 'b_cumul']
When comparing the column from the old solution with the column from the new solution, the values are exactly the same.将旧解决方案中的列与新解决方案中的列进行比较时,值完全相同。 For the original dataframe of ~6M rows, it went from taking ~4h to iterate through 5% of the
dataframe
, to ~70s to run through the whole script.对于约 6M 行的原始 dataframe ,它需要约 4 小时来迭代
dataframe
的 5%,到约 70 秒来运行整个脚本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.