简体   繁体   English

当数据集不包含频率为零的条目时计算频率的方差

[英]Calculate variance of frequencies when dataset does not contain entries of frequency zero

I have a dataset that has three fields: id, feature and frequency. 我有一个包含三个字段的数据集:id,特征和频率。 What I want to do is find out, for a group of given id's, which feature has the largest spread of frequencies. 我想要做的是找出一组给定ID的特征,该特征具有最大的频率分布。 The result I want is that if I split the group of id's into two sub-groups, using the median value of frequency for that feature, that I have two groups which are most different from each other and yet are roughly of equal size. 我想要的结果是,如果使用该功能的频率中值将一组id分成两个子组,则我会有两组彼此之间最大的不同,但大小大致相等。

My first thought was that I calculate the variance of the frequencies for each feature and use the feature where the variance is the highest. 我的第一个想法是我计算每个特征的频率方差,并使用方差最大的特征。

Given a database table which looks something like this: 给定一个数据库表,看起来像这样:

id | feature | frequency
---+---------+-------------
 0 | 0       | 1
 0 | 1       | 1
 0 | 2       | 0
 1 | 0       | 2
 1 | 1       | 2
 1 | 2       | 0
 2 | 0       | 3
 2 | 1       | 3
 2 | 2       | 8
 3 | 0       | 4
 3 | 1       | 8
 3 | 2       | 10
 4 | 0       | 5
 4 | 1       | 10
 4 | 2       | 12
  • Feature 0 has frequencies of 1, 2, 3, 4, 5 功能0的频率为1,2,3,4,5
  • Feature 1 has frequencies of 1, 2, 3, 9, 10 功能1的频率为1,2,3,9,10
  • Feature 2 has frequencies of 0, 0, 4, 10, 12 功能2的频率为0、0、4、10、12

We can see that feature 2 has the biggest spread and that splitting on 4 would make a nice point to split into two groups (0, 0 and 4 into one group and 10 and 12 into the other group). 我们可以看到特征2具有最大的扩展,将4拆分会很好地将其分为两组(0、0和4分为一组,而10和12分为另一组)。

I can calculate this with the following SQL query: 我可以使用以下SQL查询来计算:

SELECT feature, variance(frequency) as f FROM Dataset WHERE id IN (<list of ids>) GROUP BY feature ORDER BY f DESC LIMIT 1;

This works fine, but has one flaw. 这可以正常工作,但有一个缺陷。 My dataset is sparse (most entries have a frequency of zero) and it is expensive for me (both in terms of space and in terms of time it takes to insert the entries) to store the zero frequency items in the database. 我的数据集稀疏(大多数条目的频率为零),并且对于我(在空间和插入条目所需的时间方面)来说,将零频率项存储在数据库中对我来说是昂贵的。 Therefore my actual tables look something like this: 因此,我的实际表格如下所示:

id | feature | frequency
---+---------+-------------
 0 | 0       | 1
 0 | 1       | 1
 1 | 0       | 2
 1 | 1       | 2
 2 | 0       | 3
 2 | 1       | 3
 2 | 2       | 8
 3 | 0       | 4
 3 | 1       | 8
 3 | 2       | 10
 4 | 0       | 5
 4 | 1       | 10
 4 | 2       | 12

The above SQL query does not get the correct results now, as it needs to consider the zero frequency entries to calculate the correct variance value. 上面的SQL查询现在无法获得正确的结果,因为它需要考虑零频率条目以计算正确的方差值。 My SQL skills aren't good enough to figure out a (performant) query that can get around this limitation... 我的SQL技能不足以找出可以解决此限制的(性能)查询...

My next thought was to calculate the maximum entropy instead but that suffers from the fact that it does not take the actual frequency values (and also the "frequency"/counts of times the same frequency value is in the same dataset) into account - only the number of distinct values. 我的下一个想法是计算最大熵,但是这受到以下事实的困扰:它没有考虑实际的频率值(以及“频率” /相同频率值在同一数据集中的次数)-仅不同值的数量。 Unless I'm misunderstanding the entropy formula. 除非我误解了熵公式。

So my questions are: 所以我的问题是:

  1. Is there is a way to do this in SQL? 有没有办法在SQL中做到这一点?
  2. If not, is there a way of "adjusting" the variance calculated to account for the number of zero entries? 如果不是,是否有一种方法可以“调整”计算出的零条目数量的方差? (Assume I know how many zero entries were omitted) (假设我知道省略了多少个零条目)
  3. If yes, is there a way of doing this in a single SQL query as above? 如果是,是否有办法在上述单个SQL查询中执行此操作? (again, assume I know beforehand how many zero entries were omitted) (再次,假设我事先知道省略了多少个零条目)
  4. If neither are possible, is there a way of using entropy and adjusting for the actual values? 如果两者都不可行,是否有办法使用熵并调整实际值?
  5. Is there some other measure (eg kurtosis?) that I should consider? 我还应该考虑其他一些措施(例如峰度)吗? Are there any that can easily be adjusted for missing zero entries? 是否有可以轻松调整的零缺失条目?
  6. Or any other suggestions or alternative solutions? 或其他建议或替代解决方案?

With respect to filling in the gaps in your table, you can use a "helper" temp table with the valid list of features to UNION the missing zero-frequency values by way of a CROSS JOIN . 关于填补表中的空白,您可以使用带有有效功能列表的“ helper”临时表,通过CROSS JOINUNION缺失的零频值。 The "how" really depends on the database language you are using. “方式”实际上取决于您使用的数据库语言。 For example, suppose you have a table named "helper" with three rows (for your three different features). 例如,假设您有一个名为“ helper”的表,该表具有三行(用于三种不同功能)。 This then might work: 然后这可能起作用:

select id, feature, frequency
from have
union
select b.id
     , a.feature
     , 0 as frequency
from helper a
cross join have b
where not exists (
   select 1 from have b1
   where b1.id=b.id
     and b1.feature = a.feature
   )

Here is an SQLFiddle . 这是一个SQLFiddle

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM