简体   繁体   English

使用Java从mssql表中检索直方图

[英]retrieve histogram from mssql table using java

I want to implement java application that can connect to any sql server and load any table from it. 我想实现可以连接到任何SQL Server并从中加载任何表的Java应用程序。 For each table I want to create histogram based on some arbitrary columns. 我想为每个表基于一些任意列创建直方图。

For example if I have this table 例如,如果我有这张桌子

name   profit
------------
name1   12
name2   14
name3   18
name4   13

I can create histogram with bin size 4 based on min and max value of profit column and count number of records for each bin. 我可以基于利润列的最小值和最大值创建条带大小为4的条形图,并对每个条带的记录数进行计数。

result is: 结果是:

profit    count
---------------
12-16     3
16-20     1

My solution for this problem is retrieving all the data based on required columns and after that construct the bins and group by the records using java stream Collectors.groupingBy . 我针对此问题的解决方案是根据所需的列检索所有数据,然后使用java流Collectors.groupingBy通过记录构造容器和分组。

I'm not sure if my solution is optimized and for this I want some help to find the better algorithm specially when I have big number of records.(for example use some benefits of sql server or other frameworks that can be used.) 我不确定我的解决方案是否经过优化,为此我希望获得一些帮助,特别是当我有大量记录时,找到更好的算法(例如,使用sql server的一些好处或可以使用的其他框架。)

Can I use better algorithm for this issue? 我可以针对这个问题使用更好的算法吗?

edit 1: assume my sql result is in List data 编辑1:假设我的sql结果在列表数据中

private String mySimpleHash(Object[] row, int index) {
        StringBuilder hash = new StringBuilder();
        for (int i = 0; i < row.length; i++)
            if (i != index)
                hash.append(row[i]).append(":");
        return hash.toString();
    }
 //index is index of column for histogram
List<Object[]> histogramData = new ArrayList<>();
final Map<String, List<Object[]>> map = data.stream().collect(
                Collectors.groupingBy(row -> mySimpleHash(Arrays.copyOfRange(row, index))));
for (final Map.Entry<String, List<Object[]>> entry : map.entrySet()) {
   Object[] newRow = newData.get(rowNum); 
   double result = entry.getValue().stream()
                                .mapToDouble(row -> 
   Double.valueOf(row[index].toString())).count();
   newRow[index] = result;
   histogramData.add(newRow);
}

As you have considered, performing the aggregation after getting all the data out of SQL server is going to be very expensive if the number of rows in your tables increase. 正如您已经考虑的那样,如果表中的行数增加,则在将所有数据移出SQL Server之后执行聚合将非常昂贵。 You can simply do the aggregation within SQL. 您可以简单地在SQL中进行聚合。

Depending on how you are expressing your histogram bins, this is either trivial or requires some work. 根据您表示直方图箱的方式,这是微不足道的或需要做一些工作。 In your case, the requirement that the lowest bin start at min value requires a little bit of setup as opposed to binning starting from 0. See sample below. 在您的情况下,最低bin从min值开始的​​要求需要一些设置,而不是从0开始的binning。请参见下面的示例。 The inner query is mapping values to a bin number, the outer query is aggregating and computing the bin boundaries. 内部查询是将值映射到bin编号,外部查询是聚合和计算bin边界。

CREATE TABLE Test (
    Name varchar(max) NOT NULL,
    Profit int NOT NULL
)

INSERT Test(Name, Profit)
VALUES
('name1', 12),
('name2', 14),
('name3', 18),
('name4', 13)

DECLARE @minValue int = (SELECT MIN(Profit) FROM Test)
DECLARE @binSize int = 4

SELECT
    (@minValue + @binSize * Bin) AS BinLow,
    (@minValue + @binSize * Bin) + @binSize - 1 AS BinHigh,
    COUNT(*) AS Count
FROM (
    SELECT
    ((Profit - @minValue) / @binSize) AS Bin
    FROM
    Test
) AS t
GROUP BY Bin


| BinLow | BinHigh | Count |
|--------|---------|-------|
|     12 |      15 |     3 |
|     16 |      19 |     1 |

http://sqlfiddle.com/#!18/d093c/9 http://sqlfiddle.com/#!18/d093c/9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM