简体   繁体   English

使用Java 8 Stream解析.csv文件

[英]Parsing .csv file using Java 8 Stream

I have a .csv file full of data on over 500 companies. 我有一个.csv文件,其中包含有关500多家公司的数据。 Each row in the file refers to a particular companies dataset. 文件中的每一行都引用一个特定的公司数据集。 I need to parse this file and extrapolate data from each to call 4 different web services. 我需要解析此文件并从每个文件推断数据以调用4个不同的Web服务。

The first line of the .csv file contains the column names. .csv文件的第一行包含列名称。 I am trying to write a method that takes a string param and this relates to the column title found in the .csv file. 我正在尝试编写一个采用字符串参数的方法,该方法与.csv文件中的列标题有关。

Based on this param, I want the method to parse the file using Java 8's Stream functionality and return a list of the data taken from the column title for each row/company. 基于此参数,我希望该方法使用Java 8的Stream功能解析文件,并返回从每一行/公司的列标题中获取的数据列表。

I feel like I am making it more complicated than it needs to be but cannot think of a more efficient way to achieve my goal. 我觉得我正在使它变得比原来更复杂,但无法想到实现我的目标的更有效方法。

Any thoughts or ideas would be greatly appreciated. 任何想法或想法将不胜感激。

Searching through stackoverflow I found the following post which is similar but not quite the same. 搜索stackoverflow我发现以下帖子是相似的但不是完全相同。 Parsing a CSV file for a unique row using the new Java 8 Streams API 使用新的Java 8 Streams API解析CSV文件以获得唯一行

    public static List<String> getData(String titleToSearchFor) throws IOException{
    Path path = Paths.get("arbitoryPath");
    int titleIndex;
    String retrievedData = null;
    List<String> listOfData = null;

    if(Files.exists(path)){ 
        try(Stream<String> lines = Files.lines(path)){
            List<String> columns = lines
                    .findFirst()
                    .map((line) -> Arrays.asList(line.split(",")))
                    .get();

            titleIndex = columns.indexOf(titleToSearchFor);

            List<List<String>> values = lines
                    .skip(1)
                    .map(line -> Arrays.asList(line.split(",")))
                    .filter(list -> list.get(titleIndex) != null)
                    .collect(Collectors.toList());

            String[] line = (String[]) values.stream().flatMap(l -> l.stream()).collect(Collectors.collectingAndThen(
                    Collectors.toList(), 
                    list -> list.toArray()));
            String value = line[titleIndex];
            if(value != null && value.trim().length() > 0){
                retrievedData = value;
            }
            listOfData.add(retrievedData);
        }
    }
    return listOfTitles;
}

Thanks 谢谢

You should not reinvent the wheel and use a common csv parser library. 您不应重新发明轮子,而应使用通用的csv解析器库。 For example you can just use Apache Commons CSV . 例如,您可以只使用Apache Commons CSV

It will handle a lot of things for you and is much more readable. 它会为您处理很多事情,并且更具可读性。 There is also OpenCSV , which is even more powerful and comes with annotations based mappings to data classes. 还有OpenCSV ,它甚至更强大,并带有基于注释的数据类映射。

 try (Reader reader = Files.newBufferedReader(Paths.get("file.csv"));
            CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT
                    .withFirstRecordAsHeader()        
        ) {
            for (CSVRecord csvRecord : csvParser) {
                // Access
                String name = csvRecord.get("MyColumn");
                // (..)
          }

Edit: Anyway, if you really want to do it on your own, take a look at this example. 编辑:无论如何,如果您真的想自己做,请看一下示例。

I managed to shorten your snippet a bit. 我设法缩短了您的代码段。

If I get you correctly, you need all values of a particular column. 如果我正确理解,则需要特定列的所有值。 The name of that column is given. 该列的名称已给出。

The idea is the same, but I improved reading from the file (it reads once); 想法是一样的,但是我改进了从文件中读取(读取一次)的过程。 removed code duplication (like line.split(",") ), unnecessary wraps in List ( Collectors.toList() ). 删除了重复的代码(例如line.split(",") ),并在List不必要的包装( Collectors.toList() )。

// read lines once
List<String[]> lines = lines(path).map(l -> l.split(","))
                                  .collect(toList());

// find the title index
int titleIndex = lines.stream()
                      .findFirst()
                      .map(header -> asList(header).indexOf(titleToSearchFor))
                      .orElse(-1);

// collect needed values
return lines.stream()
            .skip(1)
            .map(row -> row[titleIndex])
            .collect(toList());

I've got 2 tips not related to the issue: 我有2条与该问题无关的提示:

1. You have hardcoded a URI, it's better to move the value to a constant or add a method param. 1.您已经对URI进行了硬编码,最好将值移动到常量或添加方法参数。
2. You could move the main part out of the if clause if you checked the opposite condition !Files.exists(path) and threw an exception. 2.如果检查了相反的条件!Files.exists(path)并引发了异常, if可以将主要部分移出if子句。

1) You cannot invoke multiple terminal operations on a Stream. 1)您不能在Stream上调用多个终端操作。
But you invoke two of them : findFirst() to retrieve the column names and then collect() to collect the line values. 但是,您需要调用其中两个: findFirst()检索列名,然后collect()收集行值。 The second terminal operation invoked on the Stream will throw an exception. 在Stream上调用的第二个终端操作将引发异常。

2) Instead of Stream<String> lines = Files.lines(path)) that reads all lines in a Stream, you should make things in two times by using Files.readAllLines() that return a List of String. 2)应该使用返回返回字符串列表的Files.readAllLines()两次,而不是读取Stream中所有行的Stream<String> lines = Files.lines(path))
Use the first element to retrieve the column name and use the whole list to retrieve the value of each line matching to the criteria. 使用第一个元素来检索列名,并使用整个列表来检索与条件匹配的每行的值。

3) You split the retrieval in multiple little steps that you can shorter in a single stream processing that will iterate all lines, keep only which of them where the criteria matches and collect them. 3)您将检索分为多个小步骤,可以在单个流处理中将其缩短,这将迭代所有行,仅保留符合条件的行并收集它们。

It would give something like : 它会给像:

public static List<String> getData(String titleToSearchFor) throws IOException {
    Path path = Paths.get("arbitoryPath");

    if (Files.exists(path)) {
        List<String> lines = Files.readAllLines(path);

        List<String> columns = Arrays.asList(lines.get(0)
                                                  .split(","));

        int titleIndex = columns.indexOf(titleToSearchFor);

        List<String> values = lines.stream()
                                   .skip(1)
                                   .map(line -> Arrays.asList(line.split(",")))
                                   .map(list -> list.get(titleIndex))
                                   .filter(Objects::nonNull)
                                   .filter(s -> s.trim()
                                                 .length() > 0)
                                   .collect(Collectors.toList());

        return values;
    }

    return new ArrayList<>();

}

As usual, you should use Jackson! 和往常一样,您应该使用Jackson! Check out the docs 查看文档

If you want Jackson to use the first line as header info: 如果您想让杰克逊使用第一行作为标题信息:

public class CsvExample {
    public static void main(String[] args) throws IOException {
        String csv = "name,age\nIBM,140\nBurger King,76";
        CsvSchema bootstrapSchema = CsvSchema.emptySchema().withHeader();
        ObjectMapper mapper = new CsvMapper();
        MappingIterator<Map<String, String>> it = mapper.readerFor(Map.class).with(bootstrapSchema).readValues(csv);
        List<Map<String, String>> maps = it.readAll();
    }
}

or you can define your schema as a java object: 或者您可以将模式定义为Java对象:

public class CsvExample {
    private static class Pojo {
        private final String name;
        private final int age;

        @JsonCreator
        public Pojo(@JsonProperty("name") String name, @JsonProperty("age") int age) {
            this.name = name;
            this.age = age;
        }

        @JsonProperty("name")
        public String getName() {
            return name;
        }

        @JsonProperty("age")
        public int getAge() {
            return age;
        }
    }

    public static void main(String[] args) throws IOException {
        String csv = "name,age\nIBM,140\nBurger King,76";
        CsvSchema bootstrapSchema = CsvSchema.emptySchema().withHeader();
        ObjectMapper mapper = new CsvMapper();
        MappingIterator<Pojo> it = mapper.readerFor(Pojo.class).with(bootstrapSchema).readValues(csv);
        List<Pojo> pojos = it.readAll();
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM