简体   繁体   English

如何将文本的所有段落读入列表?

[英]How can I read all of the paragraphs of a text into a list?

I'm trying to break a text into its various paragraphs. 我正在尝试将文本分成不同的段落。 I did find this question and this question. 我确实找到了这个问题和这个问题。 However, I've already figured out how to detect the paragraphs. 但是,我已经弄清楚了如何检测段落。 I'm having trouble saving them. 我在保存它们时遇到了麻烦。

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.  He lay on
his armour-like back, and if he lifted his head a little he could
see his brown belly, slightly domed and divided by arches into stiff
sections.  The bedding was hardly able to cover it and seemed ready
to slide off any moment.  His many legs, pitifully thin compared
with the size of the rest of him, waved about helplessly as he
looked.

"What's happened to me?" he thought.  It wasn't a dream.  His room,
a proper human room although a little too small, lay peacefully
between its four familiar walls.  A collection of textile samples

The text above would be counted as two paragraphs. 以上案文将被视为两段。 Below is the function that I am using for paragraph detection. 以下是我用于段落检测的功能。

public List<Paragraph> findParagraph(List<String> originalBook)
{
    List<Paragraph> paragraphs = new LinkedList<Paragraph>();
    List<String> sentences = new LinkedList<String>();


    for(int i=0;i<originalBook.size();i++)
    {
        //if it isn't a blank line
        //don't count I,II symbols
        if(!originalBook.get(i).equalsIgnoreCase("") & originalBook.get(i).length()>2)
        {
            sentences.add(originalBook.remove(i));

            //if the line ahead of where you are is a blank line you've reach the end of the paragraph
            if(i < originalBook.size()-1)
            {
                if(originalBook.get(i+1).equalsIgnoreCase("") )
                {
                    Paragraph paragraph = new Paragraph();
                    List<String> strings = sentences;
                    paragraph.setSentences(strings);
                    paragraphs.add(paragraph);
                    sentences.clear();
                }
            }
        }

    }

    return paragraphs;
}

And this is the class that defines my Paragraph 这是定义我的段落的类

public class Paragraph
{

    private List<String> sentences;

    public Paragraph()
    {
        super();
    }


    public List<String> getSentences() {
        return sentences;
    }

    public void setSentences(List<String> sentences) {
        this.sentences = sentences;
    }

}

I'm able to detect the paragraphs fine, but I'm clearing all of the sentences and I'm getting a list that only contains the last paragraph. 我能够很好地检测到段落,但是我清除了所有句子,并且得到了仅包含最后一段的列表。 I've been trying to think of a solution and I haven't been able to come up with one. 我一直在尝试一种解决方案,但我一直无法提出解决方案。 Can anybody offer any advice? 有人可以提供任何建议吗?

I've tried to be as thorough as possible in my explanation. 我在解释中尝试了尽可能全面。 I can add more details if necessary. 如果需要,我可以添加更多详细信息。

The issue is in this block: 问题在以下块中:

Paragraph paragraph = new Paragraph();
List<String> strings = sentences; // <-- !!!!!
paragraph.setSentences(strings);
paragraphs.add(paragraph);
sentences.clear();

You use the same object that sentences points to for all your paragraphs, so in the end all your Paragraph objects will point to the same List<String> . 您为所有段落使用sentences指向的同一对象,因此最后所有Paragraph对象将指向相同的 List<String> Thus, any change you make to sentences will alter that single List<String> , and the changes will be seen across all your Paragraph objects, as they all refer to the same instance. 因此,您对sentences所做的任何更改都会更改单个List<String> ,并且更改将在所有Paragraph对象中看到,因为它们都引用同一实例。

It's a little like if sentences were a balloon, what you're doing is giving all your Paragraph objects a string leading to that balloon (plus another string leading back to sentences ). 就像sentences是一个气球一样,您正在做的是为所有Paragraph对象提供一个字符串,该字符串导致该气球(以及另一个返回sentences字符串)。 If one of those objects (or the sentences reference) decides to follow the string and pop the balloon, everyone will see the change. 如果这些对象之一(或sentences引用)决定跟随字符串并弹出气球,则每个人都将看到更改。

The solution is simple. 解决方案很简单。 Skip sentences.clear() and simply use List<String> strings = new LinkedList<>() instead of List<String> strings = sentences . 跳过sentences.clear()并仅使用List<String> strings = new LinkedList<>()而不是List<String> strings = sentences That way, all your Paragraph objects will have distinct List<String> objects that hold their sentences, and changes you make to any one of them will be independent of the other. 这样,您所有的Paragraph对象将具有保存其句子的不同的 List<String>对象,并且您对其中任何一个所做的更改将独立于另一个。 If you do that, you can skip declaring sentences at the beginning of the method too. 如果这样做,您也可以在方法开始时跳过声明sentences

You can change your code to be more efficient and clean, rather than calculating its index and creating multiple if statements. 您可以将代码更改为更加高效和简洁,而不是计算其索引并创建多个if语句。

sample: 样品:

Scanner scan = new Scanner(new File("text.txt"));
String parag = "";

while(scan.hasNextLine())
{
    String s = scan.nextLine();
    if(s.trim().length() != 0)
        parag += s + "\n"; //new sentence
    else
    {
        System.out.println(parag); //new paragraph
        parag = "";
    }
}

System.out.println(parag); //last paraggraph

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM