给定10个函数y = a + bx和1000个（x，y）数据点四舍五入到整数，如何导出10个最佳（a，b）元组？

Question

We build software that audits fees charged by banks to merchants that accept credit and debit cards. 我们构建的软件可以审核银行向接受信用卡和借记卡的商家收取的费用。 Our customers want us to tell them if the card processor is overcharging them. 我们的客户希望我们告诉他们卡处理器是否过度充电。 Per-transaction credit card fees are calculated like this: 每笔交易信用卡费用的计算如下：

fee = fixed + variable*transaction_price

A "fee scheme" is the pair of (fixed, variable) used by a group of credit cards, eg "MasterCard business debit gold cards issued by First National Bank of Hollywood". “费用计划”是由一组信用卡使用的(fixed, variable)对，例如“由好莱坞第一国家银行发行的万事达卡商业借记卡金卡”。 We believe there are fewer than 10 different fee schemes in use at any time, but we aren't getting a complete nor current list of fee schemes from our partners. 我们认为在任何时候使用的费用计划少于10种，但我们没有从合作伙伴处获得完整的或当前的费用计划清单。 (yes, I know that some "fee schemes" are more complicated than the equation above because of caps and other gotchas, but our transactions are known to have only a + bx schemes in use). （是的，我知道一些“费用计划”比上面的等式更复杂，因为有上限和其他陷阱，但我们的交易已知只有a + bx方案在使用）。

Here's the problem we're trying to solve: we want to use per-transaction data about fees to derive the fee schemes in use. 以下是我们要解决的问题：我们希望使用有关费用的每笔交易数据来推导使用中的费用方案。 Then we can compare that list to the fee schemes that each customer should be using according to their bank. 然后，我们可以比较该列表的费用计划，每个客户应该根据自己的银行使用。

The data we get about each transaction is a data tuple: (card_id, transaction_price, fee) . 我们获得的每个事务的数据是一个数据元组： (card_id, transaction_price, fee) 。

transaction_price and fee are in integer cents. transaction_price和fee是整数美分。 The bank rolls over fractional cents for each transation until the cumulative is greater than one cent, and then a "rounding cent" will be attached to the fees of that transaction. 银行每次转账都会超过分数美分，直到累计金额超过1美分，然后“四舍五入”将附加到该交易的费用上。 We cannot predict which transaction the "rounding cent" will be attached to. 我们无法预测哪些交易将附加“四舍五入”。

card_id identifies a group of cards that share the same fee scheme. card_id标识一组共享相同费用方案的卡。 In a typical day of 10,000 transactions, there may be several hundred unique card_id 's. 在10,000个交易的典型日子里，可能有几百个唯一的card_id 。 Multiple card_id 's will share a fee scheme. 多个card_id将分享费用计划。

The data we get looks like this, and what we want to figure out is the last two columns. 我们得到的数据看起来像这样，我们想要弄清楚的是最后两列。

card_id    transaction_price       fee        fixed        variable
=======================================================================
12345      200                     22         ?            ?
67890      300                     21         ?            ?
56789      150                      8         ?            ?
34567      150                      8         ?            ?
34567      150    "rounding cent"-> 9         ?            ?
34567      150                      8         ?            ?

The end result we want is a short list like this with 10 or fewer entries showing the fee schemes that best fit our data. 我们想要的最终结果是一个这样的简短列表，其中10个或更少的条目显示了最适合我们数据的费用方案。 Like this: 像这样：

fee_scheme_id       fixed     variable
======================================
1                      22            0
2                      21            0
3                       ?            ?
4                       ?            ?
...

The average fee is about 8 cents. 平均费用约为8美分。 This means the rounding cents have a huge impact and the derivation above requires a lot of data. 这意味着舍入分数会产生巨大影响，上面的推导需要大量数据。

The average transaction is 125 cents. 平均交易额为125美分。 Transaction prices are always on 5-cent boundaries. 交易价格总是在5美分的边界上。

We want a short list of fee schemes that "fit" 98%+ of the 3,000+ transactions each customer gets each day. 我们想要一份简短的费用计划清单，这些费用计划“适合”每位客户每天获得的3,000多笔交易中的98％以上。 If that's not enough data to achieve 98% confidence, we can use multiple days' of data. 如果没有足够的数据来实现98％的置信度，我们可以使用多天的数据。

Because of the rounding cents applied somewhat arbitrarily to each transaction, this isn't a simple algebra problem. 由于每次交易有点随意应用的舍入分数，这不是一个简单的代数问题。 Instead, it's a kind of statistical clustering exercise that I'm not sure how to solve. 相反，这是一种统计聚类练习，我不确定如何解决。

Any suggestions for how to approach this problem? 有关如何解决此问题的任何建议？ The implementation can be in C# or T-SQL, whichever makes the most sense given the algorithm. 实现可以在C＃或T-SQL中，无论哪种算法都是最有意义的。

Answer 1

Hough transform 霍夫变换

Consider your problem in image terms: If you would plot your input data on a diagram of price vs. fee, each scheme's entries would form a straight line (with rounding cents being noise). 在图像术语中考虑您的问题：如果您在价格与费用关系图上绘制输入数据，每个方案的条目将形成一条直线（四舍五入为噪声）。 Consider the density map of your plot as an image, and the task is reduced to finding straight lines in an image. 将绘图的密度图考虑为图像，并将任务简化为在图像中查找直线。 Which is just the job of the Hough transform . 这只是霍夫变换的工作。

You would essentially approach this by plotting one line for each transaction into a diagram of possible fixed fee versus possible variable fee, adding the values of lines where they cross. 您基本上可以通过将每个交易的一条线绘制成可能的固定费用与可能的可变费用的图表来进行处理，并添加它们交叉的线的值。 At the points of real fee schemes, many lines will intersect and form a large local maximum. 在实际费用计划的点上，许多线将相交并形成大的局部最大值。 By detecting this maximum, you find your fee scheme, and even a degree of importance for the fee scheme. 通过检测此最大值，您可以找到您的费用计划，甚至是费用计划的重要程度。

This approach will surely work, but might take some time depending on the resolution you want to achieve. 这种方法肯定会起作用，但可能需要一些时间，具体取决于您想要达到的分辨率。 If computation time proves to be an issue, remember that a Voronoi diagram of a coarse Hough space can be used as a classificator - and once you have classified your points into fee schemes, simple linear regression solves your problem. 如果计算时间被证明是一个问题，请记住粗Hough空间的Voronoi图可以用作分类器 - 一旦您将您的点分类为费用方案，简单的线性回归就可以解决您的问题。

Answer 2

Considering, that a processing query's storage requirements are in the same power of 2 as a day's worth of transaction data, I assume that such storage is not a problem, so: 考虑到处理查询的存储要求与一天的交易数据的功率相同，我认为这样的存储不是问题，因此：

First pass: Group the transactions for each card_id by transaction_price, keeping card_id, transaction_price and average fee. 第一遍：按transaction_price对每个card_id的交易进行分组，保留card_id，transaction_price和平均费用。 This can easily be done in SQL. 这可以在SQL中轻松完成。 This assumes, there are not outliers - but you can catch those at after this stage if so required. 这假设没有异常值 - 但是如果需要的话，你可以在这个阶段后捕获那些异常值。 The resulting number of rows is guaranteed to be no higher than the number of raw data points. 结果行数保证不高于原始数据点的数量。
Second pass: Per group walk these new data points (with a cursor or in C#) and calculate the average value of b. 第二遍：每组走这些新数据点（用光标或C＃）并计算b的平均值。 Again any outliers can be caught if desired after this stage. 如果需要，在此阶段之后，可以再次捕获任何异常值。
Third pass: Per group calculate the average value of a, now that b is known. 第三遍：每组计算a的平均值，现在b已知。 This is basic SQL. 这是基本的SQL。 Outliers as allways 永远都是异常值

If you decide to do the second step in a cursor you can stuff all that into a stored procedure. 如果您决定在游标中执行第二步，则可以将所有内容填充到存储过程中。

Different card_id groups, that use the same fee scheme can now be coalesced (Sorry of this is the wrong word, non-english native) into fee schemes by rounding a and b with a sane precision and again grouping. 使用相同费用方案的不同card_id组现在可以通过以精确的精确度再次分组a和b来合并费用方案（对不起，这是错误的词，非英语原生）。

Answer 3

The Hough transform is the most general answer, though I don't know how one would implement it in SQL (rather than pulling the data out and processing it in a general purpose language of your choice). Hough变换是最常见的答案，但我不知道如何在SQL中实现它（而不是将数据拉出来并以您选择的通用语言处理它）。

Alas, the naive version is known to be slow if you have a lot of input data (1000 points is kinda medium sized) and if you want high precision results (scales as size_of_the_input / (rho_precision * theta_precision) ). 唉，如果您有大量输入数据（1000点有点中等大小）并且您想要高精度结果（比例为size_of_the_input / (rho_precision * theta_precision) ），则知道天真版本很慢。

There is a faster approach based on 2^n-trees , but there are few implementations out on the web to just plug in. (I recently did one in C++ as a testbed for a project I'm involved in. Maybe I'll clean it up and post it somewhere.) 有一种基于2 ^ n树的更快的方法，但是网上很少有实现只是插入。（我最近在C ++中做了一个作为我参与的项目的测试平台。也许我会清理它并将它贴在某处。）

If there is some additional order to the data you may be able to do better (ie do the line segments form a piecewise function?). 如果对数据有一些额外的顺序，你可以做得更好（即线段是否形成分段函数？）。

Naive Hough transform 天真霍夫变换

Define an accumulator in (theta,rho) space spanning [-pi,pi) and [0,max(hypotenuse(x,y)] as an 2D-array. 将（theta，rho）空间中的累加器定义为[-pi，pi]，将[0，max（斜边（x，y）]定义为2D数组。

Foreach point in the input data
   Foreach bin in theta
      find the distance rho of the altitude from the origin to 
      a line through (a,y) and making angle theta with the horizontal
      rho = x cos(theta) + y sin(theta)
      and increment the bin (theta,rho) in the accumulator
Find the maximum bin in the accumulator, this 
represents the most line-like structure in the data
if (theta !=0) {a = rho/sin(theta); b = -1/tan(theta);}

Reliably getting multiple lines out of a single pass takes a little more bookkeeping, but it is not significantly harder. 从一次通过中可靠地获得多行记录需要更多的记账，但这并不是更难。

You can improve the result a little by smoothing the data near the candidate peaks and fitting to get sub-bin precision which should be faster than using smaller bins and should pickup the effect of the "rounding" cents fairly smoothly. 您可以通过平滑候选峰值附近的数据并拟合以获得子仓精度来稍微改善结果，该子仓精度应该比使用较小的仓更快并且应该相当平稳地拾取“舍入”分的效果。

Answer 4

You're looking at the rounding cent as a significant source of noise in your calculations, so I'd focus on minimizing the noise due to that issue. 您正在考虑将舍入值作为计算中噪声的重要来源，因此我将重点放在最小化由于该问题引起的噪声。 The easiest way to do this IMO is to increase the sample size. 执行此操作的最简单方法是增加样本量。

Instead of viewing your data as thousands of y=mx + b (+Rounding) group your data into larger subsets: 而不是将数据视为数千个y = mx + b（+舍入）将您的数据分组为更大的子集：

If you combine X transactions with the same and look at this as (sum of X fees) = (variable rate)*(sum of X transactions) + X(base rates) (+Rounding) your rounding number the noise will likely fall to the wayside. 如果你把X交易与之相结合并将其视为（X费用之和）=（可变利率）*（X交易总和）+ X（基本利率）（+四舍五入）你的舍入数字噪音可能会下降到路边。

Get enough groups of size 'X' and you should be able to come up with a pretty close representation of the real numbers. 获得足够大小的“X”组，你应该能够得到一个非常接近的实数表示。

给定10个函数y = a + bx和1000个（x，y）数据点四舍五入到整数，如何导出10个最佳（a，b）元组？

问题描述

4 个解决方案

解决方案1
5 已采纳 2011-12-22 19:37:03

解决方案2
4 2011-12-22 20:02:22

解决方案3
2 2011-12-22 19:36:39

解决方案4
0 2011-12-22 20:21:43

给定10个函数y = a + bx和1000个（x，y）数据点四舍五入到整数，如何导出10个最佳（a，b）元组？

问题描述

4 个解决方案

解决方案1 5 已采纳 2011-12-22 19:37:03

解决方案2 4 2011-12-22 20:02:22

解决方案3 2 2011-12-22 19:36:39

解决方案4 0 2011-12-22 20:21:43

解决方案1
5 已采纳 2011-12-22 19:37:03

解决方案2
4 2011-12-22 20:02:22

解决方案3
2 2011-12-22 19:36:39

解决方案4
0 2011-12-22 20:21:43