如何解析特定的数据文件并对其内容进行聚类？ Java的

Question

I have a file just like following, 我有一个如下文件

150 event4
160 event4
160 event0
170 event4
175 event4
180 event4
190 event4
192 event3
195 event4
----------
----------

The first column is the time in milisecond the corresponding event actually occurred. 第一列是相应事件实际发生的时间（以毫秒为单位）。 so event4 occured in 150 milisecond. 因此event4在150毫秒内发生。

I have following tasks to do, 我要完成以下任务

Iterate through the lines one by one. 逐行迭代。
If there is a gap between consecutive events less than 80 millisecond they are a sequence of a single activity. 如果连续事件之间的间隔小于80毫秒，则它们是单个活动的序列。

for example 例如

100 event4
120 event5 
140 event6
200 event4

all of them have got consecutive difference not more than 80 millisecond. 它们之间的连续时间差不超过80毫秒。 If there is a difference more than 80 millisecond that means current sequence ended and new sequence started. 如果相差超过80毫秒，则表示当前序列已结束，新序列已开始。 My goal is to cluster the sequences. 我的目标是对序列进行聚类。 And in different clusters report the number of particular events. 并在不同的群集中报告特定事件的数量。 So, in the following example in cluster 1 event 4 occurred 4 times, event 5 1 and event 6 1 time. 因此，在以下示例中，群集1中的事件4发生了4次，事件5 1和事件6 1发生了。 in the second cluster event 4 3 times and event5 1 time. 在第二个群集中，事件4 3次和事件5 1次。

100 event4
120 event5 
140 event6
200 event4

300 event4
320 event4 
340 event4
400 event5

What I am doing now is that, 我现在正在做的是，

I make a list of strings. 我列出了一个字符串列表。 I parse the file, and measure the gap between lines if it is less than 80 millisecond I add them to the list. 我解析文件，并测量行之间的间隔（如果小于80毫秒），将它们添加到列表中。
when I found an event with more than 80 millisecond gap I stop adding and create a new list for next sequence. 当我发现间隔超过80毫秒的事件时，我停止添加并为下一个序列创建新列表。
after having all the sequence in different lists i then traverse through the lists to measure the number of particular events. 在将所有序列置于不同列表中之后，我将遍历列表以测量特定事件的数量。

I dont know this is an efficient approach or not. 我不知道这是不是一种有效的方法。 I have certain problems. 我有一些问题。

I do not know how many cluster of sequences over there, so the number of lists i want to store particular clusters is not fixed. 我不知道那里有多少个序列簇，所以我要存储特定簇的列表数量不是固定的。
The event names are not fixed. 事件名称不是固定的。 it can be event1 to event100 or event 1 to 45. So, number of variables used to store event numbers is not fixed too. 它可以是event1到event100或event 1到45。因此，用于存储事件号的变量数也不是固定的。

So, do u guys have any more good ideas? 那么，你们还有更多好主意吗？

Answer 1

It's not what one would call "clustering" in science, but just grouping or aggregation. 在科学中，这不是所谓的“聚类”，而只是分组或聚合。 You aggregate events unless they are separated by too much time. 您汇总事件，除非它们之间相隔太多时间。

As for the approach, you are pursuing the canonical approach. 至于方法，您正在追求规范的方法。 You can't do better than linear unless your data is already in a complex data base index. 除非数据已经在复杂的数据库索引中，否则您做不到线性。 As long as it is a text file, there is no way except to read it linearly. 只要是文本文件，就无法线性读取。

As for the data structures, there is nothing wrong with organizing it as an ArrayList<ArrayList<String>> or ArrayList<HashMap<String, Integer>> , as the event IDs are strings. 对于数据结构，将其组织为ArrayList<ArrayList<String>>或ArrayList<HashMap<String, Integer>>没什么错，因为事件ID是字符串。 The memory requirements should be moderate and scale up to a Gigabyte. 内存要求应该适中，并且可以扩展到千兆字节。 If you are running into memory problems, try maintaining a HashSet<String> to keep only one copy of each event string, and convert the time to a numerical data type. 如果遇到内存问题，请尝试维护HashSet<String>以仅保留每个事件字符串的一个副本，然后将时间转换为数值数据类型。 You should then be able to load several GB when you have few enough events. 然后，当您有足够少的事件时，您应该能够加载几个GB。

Actually I don't see any major challenge here. 实际上，我在这里看不到任何重大挑战。

如何解析特定的数据文件并对其内容进行聚类？ Java的

问题描述

1 个解决方案

解决方案1
1 2012-09-12 06:38:23

如何解析特定的数据文件并对其内容进行聚类？ Java的

问题描述

1 个解决方案

解决方案1 1 2012-09-12 06:38:23

解决方案1
1 2012-09-12 06:38:23