简体   繁体   中英

data structure choice for a specific file processing need - java

I looked up the questions similar to mine but I am looking for an optimal solution within the constraints of java in-built data structures.

I have two plain text files. While file1 has a list of usernames, file2 has twitter posts from those users and others. The twitter posts are simply shoved as plain text in the file.

For each user, if there exists a post, I have to pull all the distinct hashtags used in the post(s) (assume hashtags are integers and each post is confined to one line).

Here is my choice of data structure

Map<String, LinkedHashSet<Integer>> usernames = new HashMap<>();

My approach to the problem

  1. Read file1 to populate the usernames keys, put default value as null.
  2. Read file2 sequentially, something like post = file2.readLine()
  3. if a username in the post is found in the hashMap keys, add all discovered hashtags in the post to the value Set.

Does this approach and the data structures picked sound like a good approach for a million users (file1) and say 10 million posts (file2)?

I'd say that you're reinventing the wheel. Why worry about making an in-memory relational data model of your own, when there are excellent, fast, capable, mature, robust, and free Java relational databases available.

If I were to do this, I'd simply write a program to read in the data from the text files, and then insert the data into my database. I recommend HSQLDB. Apache Derby is also available as is SQLite if used with a separately available JDBC driver.

The RDBMs takes care of the searching, storing, and data-mapping for you. It would likely be far more robust and more performant than any solution you tried to roll on your own.

If I were to use HSQLDB for this project, then DDL that I would write would look something like this:

CREATE CACHED TABLE Users (
    user_id       INTEGER       GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    :
    :

};

CREATE CACHED TABLE Tweets (
    tweet_id      INTEGER       GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    user_id       INTEGER       NULL,
    :
    :

    CONSTRAINT    twe_fk_user   FOREIGN KEY ( user_id ) REFERENCES Users ( user_id )
);

CREATE CACHED TABLE Tags ( 
    tag_id      INTEGER         GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
    :
    :

);

CREATE CACHED TABLE Tweet_Tag_Bridge (
    tweet_id     INTEGER       NULL,
    tag_id       INTEGER       NULL,

    CONSTRAINT   bridge_pk     PRIMARY KEY ( tweet_id, tag_id ),
    CONSTRAINT   brid_fk_twe   FOREIGN KEY ( tweet_id ) REFERENCES Tweets ( tweet_id ),
    CONSTRAINT   brid_fk_tag   FOREIGN KEY ( tag_id )  REFERENCES Tags ( tag_id )
);

Table tweets is mapped to have a many-to-one relationship with users (a user may have many tweets); and tweets have a many-to-many relationship with tags via the bridge table, tweet_tag_bridge. The use of the primary key in the bridge table assures that tags are unique for any individual tweet (ie no tweet should have more than one of any tag).

您可能要使用TreeSet<Integer>而不是LinkedHashSet<Integer> -它会使用较少的内存(因为它没有负载因子)。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM