简体   繁体   English

特定文件处理需求的数据结构选择-Java

[英]data structure choice for a specific file processing need - java

I looked up the questions similar to mine but I am looking for an optimal solution within the constraints of java in-built data structures. 我查找了与我类似的问题,但是我正在Java内置数据结构的约束范围内寻找最佳解决方案。

I have two plain text files. 我有两个纯文本文件。 While file1 has a list of usernames, file2 has twitter posts from those users and others. 虽然file1具有用户名列表,但file2具有来自这些用户和其他用户的Twitter帖子。 The twitter posts are simply shoved as plain text in the file. Twitter帖子只是简单地以纯文本形式出现在文件中。

For each user, if there exists a post, I have to pull all the distinct hashtags used in the post(s) (assume hashtags are integers and each post is confined to one line). 对于每个用户,如果有一个帖子,我必须拉出帖子中使用的所有不同的主题标签(假设主题标签是整数,并且每个帖子都限于一行)。

Here is my choice of data structure 这是我选择的数据结构

Map<String, LinkedHashSet<Integer>> usernames = new HashMap<>();

My approach to the problem 我解决这个问题的方法

  1. Read file1 to populate the usernames keys, put default value as null. 读取file1以填充用户名键,并将默认值设置为null。
  2. Read file2 sequentially, something like post = file2.readLine() 顺序读取file2,类似于post = file2.readLine()
  3. if a username in the post is found in the hashMap keys, add all discovered hashtags in the post to the value Set. 如果在hashMap键中找到了帖子中的用户名,则将帖子中所有发现的主题标签添加到值Set中。

Does this approach and the data structures picked sound like a good approach for a million users (file1) and say 10 million posts (file2)? 对于一百万个用户(文件1)和一千万个帖子(文件2)来说,这种方法和选择的数据结构听起来是否是一个好方法?

I'd say that you're reinventing the wheel. 我想说你是在重新发明轮子。 Why worry about making an in-memory relational data model of your own, when there are excellent, fast, capable, mature, robust, and free Java relational databases available. 当有可用的优秀,快速,强大,成熟,健壮和免费的Java关系数据库时,为什么还要担心创建自己的内存中关系数据模型。

If I were to do this, I'd simply write a program to read in the data from the text files, and then insert the data into my database. 如果要这样做,我只需编写一个程序以从文本文件中读取数据,然后将数据插入数据库。 I recommend HSQLDB. 我建议使用HSQLDB。 Apache Derby is also available as is SQLite if used with a separately available JDBC driver. 如果与单独提供的JDBC驱动程序一起使用,则Apache Derby和SQLite一样可用。

The RDBMs takes care of the searching, storing, and data-mapping for you. RDBM为您处理搜索,存储和数据映射。 It would likely be far more robust and more performant than any solution you tried to roll on your own. 与您尝试自行推出的任何解决方案相比,它可能会更强大,更高效。

If I were to use HSQLDB for this project, then DDL that I would write would look something like this: 如果我要为该项目使用HSQLDB,那么我要编写的DDL看起来应该像这样:

CREATE CACHED TABLE Users (
    user_id       INTEGER       GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    :
    :

};

CREATE CACHED TABLE Tweets (
    tweet_id      INTEGER       GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    user_id       INTEGER       NULL,
    :
    :

    CONSTRAINT    twe_fk_user   FOREIGN KEY ( user_id ) REFERENCES Users ( user_id )
);

CREATE CACHED TABLE Tags ( 
    tag_id      INTEGER         GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
    :
    :

);

CREATE CACHED TABLE Tweet_Tag_Bridge (
    tweet_id     INTEGER       NULL,
    tag_id       INTEGER       NULL,

    CONSTRAINT   bridge_pk     PRIMARY KEY ( tweet_id, tag_id ),
    CONSTRAINT   brid_fk_twe   FOREIGN KEY ( tweet_id ) REFERENCES Tweets ( tweet_id ),
    CONSTRAINT   brid_fk_tag   FOREIGN KEY ( tag_id )  REFERENCES Tags ( tag_id )
);

Table tweets is mapped to have a many-to-one relationship with users (a user may have many tweets); 表推文被映射为与用户具有多对一关系(一个用户可能有很多推文); and tweets have a many-to-many relationship with tags via the bridge table, tweet_tag_bridge. 和tweet通过桥表tweet_tag_bridge与标签具有多对多关系。 The use of the primary key in the bridge table assures that tags are unique for any individual tweet (ie no tweet should have more than one of any tag). 在桥接表中使用主键可确保标签对于任何单独的推文都是唯一的(即,任何推文中的标签不得超过一个)。

您可能要使用TreeSet<Integer>而不是LinkedHashSet<Integer> -它会使用较少的内存(因为它没有负载因子)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM