简体   繁体   中英

Reading chunks of a text file with a Java 8 Stream

Java 8 has a way to create a Stream from lines of a file. In this case, foreach will step through lines. I have a text file with following format..

bunch of lines with text
$$$$
bunch of lines with text
$$$$

I need to get each set of lines that goes before $$$$ into a single element in the Stream.

In other words, I need a Stream of Strings. Each string contains the content that goes before $$$$ .

What is the best way (with minimum overhead) to do this?

I couldn't come up with a solution that processes the lines lazily. I'm not sure if this is possible.

My solution produces an ArrayList . If you have to use a Stream , simply call stream() on it.

public class DelimitedFile {
    public static void main(String[] args) throws IOException {
        List<String> lines = lines(Paths.get("delimited.txt"), "$$$$");
        for (int i = 0; i < lines.size(); i++) {
            System.out.printf("%d:%n%s%n", i, lines.get(i));
        }
    }

    public static List<String> lines(Path path, String delimiter) throws IOException {
        return Files.lines(path)
                .collect(ArrayList::new, new BiConsumer<ArrayList<String>, String>() {
                    boolean add = true;

                    @Override
                    public void accept(ArrayList<String> lines, String line) {
                        if (delimiter.equals(line)) {
                            add = true;
                        } else {
                            if (add) {
                                lines.add(line);
                                add = false;
                            } else {
                                int i = lines.size() - 1;
                                lines.set(i, lines.get(i) + '\n' + line);
                            }
                        }
                    }
                }, ArrayList::addAll);
    }
}

File content:

bunch of lines with text
bunch of lines with text2
bunch of lines with text3
$$$$
2bunch of lines with text
2bunch of lines with text2
$$$$
3bunch of lines with text
3bunch of lines with text2
3bunch of lines with text3
3bunch of lines with text4
$$$$

Output:

0:
bunch of lines with text
bunch of lines with text2
bunch of lines with text3
1:
2bunch of lines with text
2bunch of lines with text2
2:
3bunch of lines with text
3bunch of lines with text2
3bunch of lines with text3
3bunch of lines with text4

Edit:

I've finally come up with a solution which lazily generates the Stream :

public static Stream<String> lines(Path path, String delimiter) throws IOException {
    Stream<String> lines = Files.lines(path);
    Iterator<String> iterator = lines.iterator();
    return StreamSupport.stream(Spliterators.spliteratorUnknownSize(new Iterator<String>() {
        String nextLine;

        @Override
        public boolean hasNext() {
            if (nextLine != null) {
                return true;
            }
            while (iterator.hasNext()) {
                String line = iterator.next();
                if (!delimiter.equals(line)) {
                    nextLine = line;
                    return true;
                }
            }
            lines.close();
            return false;
        }

        @Override
        public String next() {
            if (!hasNext()) {
                throw new NoSuchElementException();
            }
            StringBuilder sb = new StringBuilder(nextLine);
            nextLine = null;
            while (iterator.hasNext()) {
                String line = iterator.next();
                if (delimiter.equals(line)) {
                    break;
                }
                sb.append('\n').append(line);
            }
            return sb.toString();
        }
    }, Spliterator.ORDERED | Spliterator.NONNULL | Spliterator.IMMUTABLE), false);
}

This is actually/coincidentally very similar to the implementation of BufferedReader.lines() (which is internally used by Files.lines(Path) ). It may be less overhead not to use both of these methods but instead use Files.newBufferedReader(Path) and BufferedReader.readLine() directly.

You could try

    List<String> list = new ArrayList<>();
    try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
            list = stream
                .filter(line -> !line.equals("$$$$"))
                .collect(Collectors.toList());
    } catch (IOException e) {
        e.printStackTrace();
    }

There already exists a similar shorter answer, but type.safe is the following, without extra state:

    Path path = Paths.get("... .txt");
    try {
        List<StringBuilder> glist = Files.lines(path, StandardCharsets.UTF_8)
                .collect(() -> new ArrayList<StringBuilder>(),
                        (list, line) -> {
                            if (list.isEmpty() || list.get(list.size() - 1).toString().endsWith("$$$$\n")) {
                                list.add(new StringBuilder());
                            }
                            list.get(list.size() - 1).append(line).append('\n');
                        },
                        (list1, list2) -> {
                            if (!list1.isEmpty() && !list1.get(list1.size() - 1).toString().endsWith("$$$$\n")
                                    && !list2.isEmpty()) {
                                // Merge last of list1 and first of list2:
                                list1.get(list1.size() - 1).append(list2.remove(0).toString());
                            }
                            list1.addAll(list2);
                        });
        glist.forEach(sb -> System.out.printf("------------------%n%s%n", sb));
    } catch (IOException ex) {
        Logger.getLogger(App.class.getName()).log(Level.SEVERE, null, ex);
    }

Instead of .endsWith("$$$$\\n") it would be better to do:

.matches("(^|\n)\\$\\$\\$\\$\n")

Here a solution based on this previous work :

public class ChunkSpliterator extends Spliterators.AbstractSpliterator<List<String>> {
    private final Spliterator<String> source;
    private final Predicate<String> delimiter;
    private final Consumer<String> getChunk;
    private List<String> current;

    ChunkSpliterator(Spliterator<String> lineSpliterator, Predicate<String> mark) {
        super(lineSpliterator.estimateSize(), ORDERED|NONNULL);
        source=lineSpliterator;
        delimiter=mark;
        getChunk=s -> {
            if(current==null) current=new ArrayList<>();
            current.add(s);
        };
    }
    public boolean tryAdvance(Consumer<? super List<String>> action) {
        while(current==null || !delimiter.test(current.get(current.size()-1)))
            if(!source.tryAdvance(getChunk)) return lastChunk(action);
        current.remove(current.size()-1);
        action.accept(current);
        current=null;
        return true;
    }
    private boolean lastChunk(Consumer<? super List<String>> action) {
        if(current==null) return false;
        action.accept(current);
        current=null;
        return true;
    }

    public static Stream<List<String>> toChunks(
        Stream<String> lines, Predicate<String> splitAt, boolean parallel) {
        return StreamSupport.stream(
            new ChunkSpliterator(lines.spliterator(), splitAt),
            parallel);
    }
}

which you can use like

try(Stream<String> lines=Files.lines(pathToYourFile)) {
    ChunkSpliterator.toChunks(
        lines,
        Pattern.compile("^\\Q$$$$\\E$").asPredicate(),
        false)
    /* chain your stream operations, e.g.
    .forEach(s -> { s.forEach(System.out::print); System.out.println(); })
     */;
}

You can use a Scanner as an iterator and create the stream from it:

private static Stream<String> recordStreamOf(Readable source) {
    Scanner scanner = new Scanner(source);
    scanner.useDelimiter("$$$$");
    return StreamSupport
        .stream(Spliterators.spliteratorUnknownSize(scanner, Spliterator.ORDERED | Spliterator.NONNULL), false)
        .onClose(scanner::close);
}

This will preserve the newlines in the chunks for further filtering or splitting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM