[英]How to parse a specific data file and cluster its contents? Java
I have a file just like following, 我有一个如下文件
150 event4
160 event4
160 event0
170 event4
175 event4
180 event4
190 event4
192 event3
195 event4
----------
----------
The first column is the time in milisecond the corresponding event actually occurred. 第一列是相应事件实际发生的时间(以毫秒为单位)。 so event4 occured in 150 milisecond.
因此event4在150毫秒内发生。
I have following tasks to do, 我要完成以下任务
Iterate through the lines one by one. 逐行迭代。
If there is a gap between consecutive events less than 80 millisecond they are a sequence of a single activity. 如果连续事件之间的间隔小于80毫秒,则它们是单个活动的序列。
for example 例如
100 event4
120 event5
140 event6
200 event4
all of them have got consecutive difference not more than 80 millisecond. 它们之间的连续时间差不超过80毫秒。 If there is a difference more than 80 millisecond that means current sequence ended and new sequence started.
如果相差超过80毫秒,则表示当前序列已结束,新序列已开始。 My goal is to cluster the sequences.
我的目标是对序列进行聚类。 And in different clusters report the number of particular events.
并在不同的群集中报告特定事件的数量。 So, in the following example in cluster 1 event 4 occurred 4 times, event 5 1 and event 6 1 time.
因此,在以下示例中,群集1中的事件4发生了4次,事件5 1和事件6 1发生了。 in the second cluster event 4 3 times and event5 1 time.
在第二个群集中,事件4 3次和事件5 1次。
100 event4
120 event5
140 event6
200 event4
300 event4
320 event4
340 event4
400 event5
What I am doing now is that, 我现在正在做的是,
I dont know this is an efficient approach or not. 我不知道这是不是一种有效的方法。 I have certain problems.
我有一些问题。
So, do u guys have any more good ideas? 那么,你们还有更多好主意吗?
It's not what one would call "clustering" in science, but just grouping or aggregation. 在科学中,这不是所谓的“聚类”,而只是分组或聚合。 You aggregate events unless they are separated by too much time.
您汇总事件,除非它们之间相隔太多时间。
As for the approach, you are pursuing the canonical approach. 至于方法,您正在追求规范的方法。 You can't do better than linear unless your data is already in a complex data base index.
除非数据已经在复杂的数据库索引中,否则您做不到线性。 As long as it is a text file, there is no way except to read it linearly.
只要是文本文件,就无法线性读取。
As for the data structures, there is nothing wrong with organizing it as an ArrayList<ArrayList<String>>
or ArrayList<HashMap<String, Integer>>
, as the event IDs are strings. 对于数据结构,将其组织为
ArrayList<ArrayList<String>>
或ArrayList<HashMap<String, Integer>>
没什么错,因为事件ID是字符串。 The memory requirements should be moderate and scale up to a Gigabyte. 内存要求应该适中,并且可以扩展到千兆字节。 If you are running into memory problems, try maintaining a
HashSet<String>
to keep only one copy of each event string, and convert the time to a numerical data type. 如果遇到内存问题,请尝试维护
HashSet<String>
以仅保留每个事件字符串的一个副本,然后将时间转换为数值数据类型。 You should then be able to load several GB when you have few enough events. 然后,当您有足够少的事件时,您应该能够加载几个GB。
Actually I don't see any major challenge here. 实际上,我在这里看不到任何重大挑战。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.