简体   繁体   English

java 8流映射检查以前的元素

[英]java 8 stream map check previous elements

I have a question I have a big text file that i'm currently reading, I want to have a list of the words in it and also find spesific pairs in it. 我有一个问题,我有一个我正在阅读的大文本文件,我希望有一个单词列表,并在其中找到特定对。

An example of my dataset is: 我的数据集的一个例子是:

A random text file . I am <pair-starter> first second <pair-ender> and it goes on and on,
and hopefully it ends .

Now I read the file with streams like 现在我用像流一样读取文件

List<String> words = Files.lines(Paths.get(filename), Charset.forName("UTF-8")).
                     .map(line -> line.split("[\\s]+"))
                     .flatMap(Arrays::stream)
                     .filter(this::filterPunctuation) //This removes the dot in example
                     .map(this::removePunctuation) //This removes the comma
                     //I think the method should be added here
                     .filter(this::removePairSpesifics) //To remove pair starter and ender
                     .collect(Collectors.toList());

Now with this code I can get the clean words I get a list that contains "A", "random", "text", "file", "I", "am", "first", "second", "and", "it", "goes", "on", "and", "on", "and", "hopefully", "it", "ends" But I also want to get a hashmap that holds the pairs in it and I wonder if it is possible with adding a new method on the stream above. 现在有了这段代码,我可以得到干净的单词,我得到的列表包含"A", "random", "text", "file", "I", "am", "first", "second", "and", "it", "goes", "on", "and", "on", "and", "hopefully", "it", "ends"但我也希望得到一个哈希映射,用于保存对它,我想知道是否可以在上面的流上添加一个新方法。 Couldn't find anything close to what I want from google, thanks in advance. 从谷歌找不到任何我想要的东西,提前谢谢。

the method close to the one in my head is 接近我头脑中的方法是

private boolean pairStarted = false;
private String addToHashMap(String element){
    if previous element was pair starter
        pairStarted = true;
    else if pairStarted and element is not pairEnder
        MyPreviouslyConstructedHashMap.put(the previous one, element);
    else if element is pairEnder
        pairStarted = false;
    return element; 
} //This function will not change anything from the list as it returns the elements
  //But it'll add the hashmap first-second pair

My current solution is: 我目前的解决方案是:

List<String> words = Files.lines(Paths.get(filename), Charset.forName("UTF-8")).
                     .map(line -> line.split("[\\s]+"))
                     .flatMap(Arrays::stream)
                     .filter(this::filterPunctuation)
                     .map(this::removePunctuation)
                     .collect(Collectors.toList()); //Now not using removePairSpesifics 
//as I need to check for them.
for(int i=words.size()-1; i>=0; i--) {
    if(words.get(i).equals("<pair-ender>")){ //checking from end to modify in the loop
        pairs.put(words.get(i-2), words.get(i-1));
        i = i-4;
        words.remove(i+1);
        words.remove(i+4);
    }
}

What I want to learn is to learn if it can be solved in the same stream as I read the values into the list. 我想要学习的是学习它是否可以在我将值读入列表的同一个流中解决。

At first, I tried to separate the split into two splits, and it worked out quite well: 起初,我试图将拆分分成两个拆分,并且效果很好:

public void split(Stream<String> lines)
{
    Pattern pairFinder = Pattern.compile("<pair-starter|pair-ender>");
    Pattern spaceFinder = Pattern.compile("[\\s]+");

    Map<String, String> pairs = new HashMap<>();

    List<String> words = lines.flatMap(pairFinder::splitAsStream).flatMap(pairOrNoPair -> {
        if (pairOrNoPair.startsWith(">") && pairOrNoPair.endsWith("<"))
        {
            pairOrNoPair = pairOrNoPair.replaceAll("> +| +<", "");

            String[] pair = spaceFinder.split(pairOrNoPair);
            pairs.put(pair[0], pair[1]);
            return Arrays.stream(pair);
        }
        else
        {
            return spaceFinder.splitAsStream(pairOrNoPair.trim());
        }
    })
                              .filter(this::filterPunctuation) // This removes the dot in example
                              .map(this::removePunctuation) // This removes the comma
                              .collect(Collectors.toList());

    System.out.println(words);
    System.out.println(pairs);
}

// Output
// [A, random, text, file, I, am, first, second, and, it, goes, on, and, on, and, hopefully, it, ends]
// {first=second}

boolean filterPunctuation(String s)
{
    return !s.matches("[,.?!]");
}

String removePunctuation(String s)
{
    return s.replaceAll("[,.?!]", "");
}

What happens here? 这里发生了什么? First, we split the line into pairs and non-pairs. 首先,我们将线分为成对和非对。 For each of those, we check whether they are a pair. 对于每一个,我们检查它们是否是一对。 If so, we remove the markers and add the pair to the list. 如果是这样,我们删除标记并将其添加到列表中。 In any case, we split the chunk by spaces, flatten it, and procede word by word. 在任何情况下,我们用空格分割块,压平它,然后逐字逐句地进行处理。

But this implementation only deals with the input line by line. 但是这种实现只能逐行处理输入。


To tackle the issue with multi-line pairs, we can try a custom Collector approach. 要解决多线对的问题,我们可以尝试自定义Collector方法。 Look at this rather quick and dirty attempt: 看看这个相当快速和肮脏的尝试:

String t1 = "I am <pair-starter> first second <pair-ender>, <pair-starter> and";
String t2 = " hopefully <pair-ender> it ends .";
split(Stream.of(t1, t2));

public void split(Stream<String> lines)
{
    PairResult result = lines.flatMap(Pattern.compile("[\\s]+")::splitAsStream)
                             .map(word -> word.replaceAll("[,.?!]", ""))
                             .filter(word -> !word.isEmpty())
                             .collect(new PairCollector());

    System.out.println(result.words);
    System.out.println(result.pairs);
}

// Output
// [I, am, first, second, and, hopefully, it, ends]
// {and=hopefully, first=second}

class PairCollector
    implements Collector<String, PairResult, PairResult>
{
    @Override
    public Supplier<PairResult> supplier()
    {
        return PairResult::new;
    }

    @Override
    public BiConsumer<PairResult, String> accumulator()
    {
        return (result, word) -> {
            if ("<pair-starter>".equals(word))
            {
                result.inPair = true;
            }
            else if ("<pair-ender>".equals(word))
            {
                if (result.inPair)
                {
                    result.pairs.put(result.words.get(result.words.size() - 2),
                                     result.words.get(result.words.size() - 1));
                    result.inPair = false;
                }
                else
                {
                    // starter must be in another result, keep ender for combiner
                    result.words.add(word);
                }
            }
            else
            {
                result.words.add(word);
            }
        };
    }

    @Override
    public BinaryOperator<PairResult> combiner()
    {
        return (result1, result2) -> {
            // add completed pairs
            result1.pairs.putAll(result2.pairs);

            // use accumulator to finish split pairs
            BiConsumer<PairResult, String> acc = accumulator();
            result2.words.forEach(word2 -> acc.accept(result1, word2));

            return result1;
        };
    }

    @Override
    public Function<PairResult, PairResult> finisher()
    {
        return Function.identity();
    }

    @Override
    public Set<Characteristics> characteristics()
    {
        return new HashSet<>(Arrays.asList(Characteristics.IDENTITY_FINISH));
    }
}

class PairResult
{
    public boolean                   inPair;
    public final List<String>        words = new ArrayList<>();
    public final Map<String, String> pairs = new HashMap<>();
}

This collector accepts word by word, and stores a bit of internal state to keep track of pairs. 该收集器逐字接受,并存储一些内部状态以跟踪对。 It should even work for parallel streams, combining the separate streams of words into one result. 它甚至应该适用于并行流,将单独的单词流组合成一个结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM