I looked up the questions similar to mine but I am looking for an optimal solution within the constraints of java in-built data structures.
I have two plain text files. While file1 has a list of usernames, file2 has twitter posts from those users and others. The twitter posts are simply shoved as plain text in the file.
For each user, if there exists a post, I have to pull all the distinct hashtags used in the post(s) (assume hashtags are integers and each post is confined to one line).
Here is my choice of data structure
Map<String, LinkedHashSet<Integer>> usernames = new HashMap<>();
My approach to the problem
Does this approach and the data structures picked sound like a good approach for a million users (file1) and say 10 million posts (file2)?
I'd say that you're reinventing the wheel. Why worry about making an in-memory relational data model of your own, when there are excellent, fast, capable, mature, robust, and free Java relational databases available.
If I were to do this, I'd simply write a program to read in the data from the text files, and then insert the data into my database. I recommend HSQLDB. Apache Derby is also available as is SQLite if used with a separately available JDBC driver.
The RDBMs takes care of the searching, storing, and data-mapping for you. It would likely be far more robust and more performant than any solution you tried to roll on your own.
If I were to use HSQLDB for this project, then DDL that I would write would look something like this:
CREATE CACHED TABLE Users (
user_id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
:
:
};
CREATE CACHED TABLE Tweets (
tweet_id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
user_id INTEGER NULL,
:
:
CONSTRAINT twe_fk_user FOREIGN KEY ( user_id ) REFERENCES Users ( user_id )
);
CREATE CACHED TABLE Tags (
tag_id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
:
:
);
CREATE CACHED TABLE Tweet_Tag_Bridge (
tweet_id INTEGER NULL,
tag_id INTEGER NULL,
CONSTRAINT bridge_pk PRIMARY KEY ( tweet_id, tag_id ),
CONSTRAINT brid_fk_twe FOREIGN KEY ( tweet_id ) REFERENCES Tweets ( tweet_id ),
CONSTRAINT brid_fk_tag FOREIGN KEY ( tag_id ) REFERENCES Tags ( tag_id )
);
Table tweets is mapped to have a many-to-one relationship with users (a user may have many tweets); and tweets have a many-to-many relationship with tags via the bridge table, tweet_tag_bridge. The use of the primary key in the bridge table assures that tags are unique for any individual tweet (ie no tweet should have more than one of any tag).
您可能要使用TreeSet<Integer>
而不是LinkedHashSet<Integer>
-它会使用较少的内存(因为它没有负载因子)。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.