简体   繁体   English

如何解析特定的数据文件并对其内容进行聚类? Java的

[英]How to parse a specific data file and cluster its contents? Java

I have a file just like following, 我有一个如下文件

150 event4
160 event4
160 event0
170 event4
175 event4
180 event4
190 event4
192 event3
195 event4
----------
----------

The first column is the time in milisecond the corresponding event actually occurred. 第一列是相应事件实际发生的时间(以毫秒为单位)。 so event4 occured in 150 milisecond. 因此event4在150毫秒内发生。

I have following tasks to do, 我要完成以下任务

  1. Iterate through the lines one by one. 逐行迭代。

  2. If there is a gap between consecutive events less than 80 millisecond they are a sequence of a single activity. 如果连续事件之间的间隔小于80毫秒,则它们是单个活动的序列。

for example 例如

100 event4
120 event5 
140 event6
200 event4

all of them have got consecutive difference not more than 80 millisecond. 它们之间的连续时间差不超过80毫秒。 If there is a difference more than 80 millisecond that means current sequence ended and new sequence started. 如果相差超过80毫秒,则表示当前序列已结束,新序列已开始。 My goal is to cluster the sequences. 我的目标是对序列进行聚类。 And in different clusters report the number of particular events. 并在不同的群集中报告特定事件的数量。 So, in the following example in cluster 1 event 4 occurred 4 times, event 5 1 and event 6 1 time. 因此,在以下示例中,群集1中的事件4发生了4次,事件5 1和事件6 1发生了。 in the second cluster event 4 3 times and event5 1 time. 在第二个群集中,事件4 3次和事件5 1次。

100 event4
120 event5 
140 event6
200 event4

300 event4
320 event4 
340 event4
400 event5

What I am doing now is that, 我现在正在做的是,

  1. I make a list of strings. 我列出了一个字符串列表。 I parse the file, and measure the gap between lines if it is less than 80 millisecond I add them to the list. 我解析文件,并测量行之间的间隔(如果小于80毫秒),将它们添加到列表中。
  2. when I found an event with more than 80 millisecond gap I stop adding and create a new list for next sequence. 当我发现间隔超过80毫秒的事件时,我停止添加并为下一个序列创建新列表。
  3. after having all the sequence in different lists i then traverse through the lists to measure the number of particular events. 在将所有序列置于不同列表中之后,我将遍历列表以测量特定事件的数量。

I dont know this is an efficient approach or not. 我不知道这是不是一种有效的方法。 I have certain problems. 我有一些问题。

  • I do not know how many cluster of sequences over there, so the number of lists i want to store particular clusters is not fixed. 我不知道那里有多少个序列簇,所以我要存储特定簇的列表数量不是固定的。
  • The event names are not fixed. 事件名称不是固定的。 it can be event1 to event100 or event 1 to 45. So, number of variables used to store event numbers is not fixed too. 它可以是event1到event100或event 1到45。因此,用于存储事件号的变量数也不是固定的。

So, do u guys have any more good ideas? 那么,你们还有更多好主意吗?

It's not what one would call "clustering" in science, but just grouping or aggregation. 在科学中,这不是所谓的“聚类”,而只是分组或聚合。 You aggregate events unless they are separated by too much time. 您汇总事件,除非它们之间相隔太多时间。

As for the approach, you are pursuing the canonical approach. 至于方法,您正在追求规范的方法。 You can't do better than linear unless your data is already in a complex data base index. 除非数据已经在复杂的数据库索引中,否则您做不到线性。 As long as it is a text file, there is no way except to read it linearly. 只要是文本文件,就无法线性读取。

As for the data structures, there is nothing wrong with organizing it as an ArrayList<ArrayList<String>> or ArrayList<HashMap<String, Integer>> , as the event IDs are strings. 对于数据结构,将其组织为ArrayList<ArrayList<String>>ArrayList<HashMap<String, Integer>>没什么错,因为事件ID是字符串。 The memory requirements should be moderate and scale up to a Gigabyte. 内存要求应该适中,并且可以扩展到千兆字节。 If you are running into memory problems, try maintaining a HashSet<String> to keep only one copy of each event string, and convert the time to a numerical data type. 如果遇到内存问题,请尝试维护HashSet<String>以仅保留每个事件字符串的一个副本,然后将时间转换为数值数据类型。 You should then be able to load several GB when you have few enough events. 然后,当您有足够少的事件时,您应该能够加载几个GB。

Actually I don't see any major challenge here. 实际上,我在这里看不到任何重大挑战。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Java中将.odt文件的内容解析为字符串? - How do you parse the contents of a .odt file into a string in Java? 如何将csv文件中的数据解析到容器中并搜索1个值并在Java中检索其行上的其余值 - how to parse data from a csv file into a container and search for 1 value and retrieve the rest of the values on its line in java 通过Java中的标题解析特定的列 - parse a specific column by its header in Java 如何在Java中解析对象并获取其所有数据成员和字段? - How to parse an object and get all its data members and fields in java? Java-JPanel及其内容如何设置界限 - Java - JPanel & its contents how to set bounds 如何将 XML 文件中的特定信息解析为 Java 中的对象? - How to parse specific information from a XML-file into objects in java? 解析.java文件并分析其AST - Parse .java file and analyze its AST 如何从 .lua 文件解析/读取特定数据? - How to parse/read specific data from .lua file? 如何在Java中删除文件内容(不是文件,需要相同的inode),然后将文件截断为特定大小(例如38个字节) - How to delete the file contents (not file, need same inode) in Java and then truncate the file to a specific size (say 38 bytes) 如何将特定数据从JSON文件解析到html下拉菜单 - How to parse specific data from JSON file to a html dropdown menu
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM