优化 Pandas 中值的插值

Question

I have been strugling with an optimization problem with Pandas.我一直在努力解决 Pandas 的优化问题。

I had developed a script to apply computation on every line of a relatively small DataFrame (~a few 1000s lines, a few dozen columns).我开发了一个脚本来对相对较小的 DataFrame 的每一行（大约 1000 行，几十列）应用计算。 I relied heavily on the apply() function which was obviously a poor choice in most cases.我严重依赖 apply() function 在大多数情况下这显然是一个糟糕的选择。

After a round of optimization I only have a method which takes time and I haven't found an easy solution for:经过一轮优化后，我只有一个需要时间的方法，而且我还没有找到一个简单的解决方案：

Basically my dataframe contains a list of video viewing statistics with the number of people who watched the video for every quartile (how many have watched 0%, 25%, 50%, etc) such as:基本上我的 dataframe 包含一个视频观看统计数据列表，其中包含每个四分位数观看视频的人数（有多少人观看了 0%、25%、50% 等），例如：

video_name视频名称	video_length视频长度	video_0视频_0	video_25视频_25	video_50视频_50	video_75视频_75	video_100视频_100
video_1视频_1	6 6	1000 1000	500 500	300 300	250 250	5 5
video_2视频_2	30 30	1000 1000	500 500	300 300	250 250	5 5

I am trying to interpolate the statistics to be able to answer "how many people would have watched each quartile of the video if it lasted X seconds"我试图插入统计数据以便能够回答“如果视频持续 X 秒，有多少人会观看视频的每个四分位数”

Right now my function takes the dataframe and a "new_length" parameter, and calls apply() on each line.现在我的 function 采用 dataframe 和“new_length”参数，并在每一行调用 apply()。

The function which handles each line computes the time marks for each quartile (so 0, 7.5, 15, 22.5 and 30 for the 30s video), and time marks for each quartile given the new length (so to reduce the 30s video to 6s, the new time marks would be 0, 1.5, 3, 4.5 and 6).处理每一行的 function 计算每个四分位数的时间标记（因此 30 秒视频的 0、7.5、15、22.5 和 30），以及给定新长度的每个四分位数的时间标记（以便将 30 秒视频减少到 6 秒，新的时间标记将是 0、1.5、3、4.5 和 6)。 I build a dataframe containing the time marks as index, and the stats as values in the first column:我构建了一个 dataframe，其中包含作为索引的时间标记，以及作为第一列中的值的统计信息：

index (time marks)索引（时间标记）	view_stats view_stats
0 0	1000 1000
7.5 7.5	500 500
15 15	300 300
22.5 22.5	250 250
30 30	5 5
1.5 1.5	NaN钠
3 3	NaN钠
4.5 4.5	NaN钠

I then call DataFrame.interpolate(method="index") to fill the NaN values.然后我调用 DataFrame.interpolate(method="index") 来填充 NaN 值。

It works and gives me the result I expect, but it is taking a whopping 11s for a 3k lines dataframe and I believe it has to do with the use of the apply() method combined with the creation of a new dataframe to interpolate the data for each line.它可以工作并给我预期的结果，但是对于 3k 行 dataframe 需要高达 11 秒的时间，我相信这与使用 apply() 方法以及创建新的 dataframe 来插入数据有关对于每一行。

Is there an obvious way achieve the same result "in place", eg by avoiding the apply / new dataframe method, directly on the original dataframe?是否有一种明显的方法可以“就地”实现相同的结果，例如直接在原始 dataframe 上避免应用 / 新 dataframe 方法？

EDIT: The expected output when calling the function with 6 as the new length parameter would be:编辑：使用 6 作为新长度参数调用 function 时，预期的 output 将是：

video_name视频名称	video_length视频长度	video_0视频_0	video_25视频_25	video_50视频_50	video_75视频_75	video_100视频_100	new_video_0新视频_0	new_video_25新视频_25	new_video_50新视频_50	new_video_75新视频_75	new_video_100新视频_100
video_1视频_1	6 6	1000 1000	500 500	300 300	250 250	5 5	1000 1000	500 500	300 300	250 250	5 5
video_2视频_2	6 6	1000 1000	500 500	300 300	250 250	5 5	1000 1000	900 900	800 800	700 700	600 600

The first line would be untouched because the video is already 6s long.第一行将保持不变，因为视频已经 6 秒长。 In the second line, the video would be cut from 30s to 6s so the new quartiles would be at 0, 1.5, 3, 4.5, 6s and the stats would be interpolated between 1000 and 500, which were the values at the old 0% and 25% time marks在第二行中，视频将从 30 秒缩短到 6 秒，因此新的四分位数将位于 0、1.5、3、4.5、6 秒，统计数据将在 1000 到 500 之间插值，这是旧 0% 的值和 25% 的时间标记

EDIT2: I do not care if I need to add temporary columns, time is an issue, memory is not. EDIT2：我不在乎是否需要添加临时列，时间是一个问题，memory 不是。

As a reference, this is my code:作为参考，这是我的代码：

def get_value(marks, asset, mark_index) -> int:
  value = marks["count"][asset["new_length_marks"][mark_index]]
  if isinstance(value, pandas.Series):
    res = value.iloc(0)
  else:
    res = value
  return math.ceil(res)

def length_update_row(row, assets, **kwargs):
  asset_name = row["asset_name"]
  asset = assets[asset_name]
  # assets is a dict containing the list of files and the old and "new" video marks
  # pre-calculated

  marks = pandas.DataFrame(data=[int(row["video_start"]), int(row["video_25"]), int(row["video_50"]), int(row["video_75"]), int(row["video_completed"])],
                            columns=["count"],
                            index=asset["old_length_marks"])
    
  marks = marks.combine_first(pandas.DataFrame(data=NaN, columns=["count"], index=asset["new_length_marks"][1:]))
  marks = marks.interpolate(method="index")
    
  row["video_25"] = get_value(marks, asset, 1)
  row["video_50"] = get_value(marks, asset, 2)
  row["video_75"] = get_value(marks, asset, 3)
  row["video_completed"] = get_value(marks, asset, 4)
  
  return row
  

def length_update_stats(report: pandas.DataFrame,
                 assets: dict) -> pandas.DataFrame:
  new_report = new_report.apply(lambda row: length_update_row(row, assets), axis=1)
  return new_report

Answer 1

IIUC, you could use np.interp : IIUC，您可以使用np.interp ：

# get the old x values
xs = df['video_length'].values[:, None] * [0, 0.25, 0.50, 0.75, 1]

# the corresponding y values
ys = df.iloc[:, 2:].values

# note that 6 is the new value
nxs = np.repeat(np.array(6), 2)[:, None] * [0, 0.25, 0.50, 0.75, 1]

res = pd.DataFrame(data=np.array([np.interp(nxi, xi, yi) for nxi, xi, yi in zip(nxs, xs, ys)]), columns="new_" + df.columns[2:] )

print(res)

Output Output

   new_video_0  new_video_25  new_video_50  new_video_75  new_video_100
0       1000.0         500.0         300.0         250.0            5.0
1       1000.0         900.0         800.0         700.0          600.0

And then concat across the second axis:然后在第二个轴上连接：

output = pd.concat((df, res), axis=1)
print(output)

Output (concat) Output （连续）

  video_name  video_length  video_0  ...  new_video_50  new_video_75  new_video_100
0    video_1             6     1000  ...         300.0         250.0            5.0
1    video_2            30     1000  ...         800.0         700.0          600.0

[2 rows x 12 columns]

优化 Pandas 中值的插值

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-14 16:50:57

优化 Pandas 中值的插值

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-14 16:50:57

解决方案1
1 已采纳 2020-12-14 16:50:57