简体   繁体   English

用Java解析文本文件(大数据集)

[英]Parsing a text file (large dataset) in java

I have a text file, each line looks like this : (a movie reviews database) 我有一个文本文件,每一行看起来都像这样:(电影评论数据库)

product/productId: B00004CK40   review/userId: A39IIHQF18YGZA   review/profileName: C. A. M. Salas  review/helpfulness: 0/0 review/score: 4.0   review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.

I want to parse this file in order to retrieve : 我想解析此文件以便检索:

  • product/productId 产品/产品编号
  • review/userId 评论/用户名
  • review/profileName 评论/个人资料名称
  • review/helpfulness 评论/帮助
  • review/score 评论/评分
  • review/time 评论/时间
  • review/summary 评测总结
  • review/text 评论/文字

This information will be later encapsulated using MovieReview & Movie class. 稍后将使用MovieReviewMovie类封装此信息。

public class MovieReview {

    private Movie movie;
    private String userId;
    private String profileName;
    private String helpfulness;
    private Date timestamp;
    private String summary;
    private String review;
...

Can anyone offer a proper & efficient way to parse this file (large dataset) ? 谁能提供正确有效的方法来解析此文件(大型数据集)?

Thanks. 谢谢。

If it's a large dataset, you'll want to avoid loading the entire list into memory at once. 如果数据集很大,则要避免将整个列表立即加载到内存中。 I'd probably solve this with a handler for each row 我可能会为每行使用一个处理程序来解决这个问题

public interface MovieReviewHandler {
    void handle(MovieReview revies);
}

Then you could parse as follows: 然后您可以解析如下:

public class MovieReviewParser {
    public void parse(BufferedReader reader, MovieReviewHandler handler) {
        Pattern regex = Pattern.compile("product/productId:(.*)review/userId:(.*)review/profileName:(.*)"); // add other fields

        String line;
        while ((line = reader.readLine()) != null) {
            Matcher matcher = regex.matcher(line);
            if (!matcher.matches()) throw new RuntimeException();
            MovieReview review = new MovieReview();
            review.productId = matcher.group(1);
            review.userId = matcher.group(2);
            review.profileName = matcher.group(3);
            // etc

            handler.handle(review);
        }
    }
}    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM