简体   繁体   English

测试文件到 XML 文件(结构问题)

[英]Test file to XML file (Problem with the structure)

I want to convert a text file to XML file with a specific structure.我想将文本文件转换为具有特定结构的 XML 文件。 I want to separate the text into paragraphs and these paragraphs will get into a chapter.我想将文本分成段落,这些段落将进入一章。 For example, every chapter should have 3 paragraphs.例如,每一章应该有 3 个段落。 The root element of XML is called "Book". XML 的根元素称为“Book”。

To give you one more example, I have this text file:再举一个例子,我有这个文本文件:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Velit scelerisque in dictum non consectetur a erat. Velit scelerisque in dictum non consectetur a erat。 Sit amet justo donec enim diam vulputate.坐在 amet justo donec enim diam vulputate。 Id aliquet lectus proin nibh nisl condimentum id venenatis a. Id aliquet lectus proin nibh nisl condimentum id venenatis a.

Eget gravida cum sociis natoque penatibus et magnis dis. Eget gravida cum sociis natoque penatibus et magnis dis。 Habitant morbi tristique senectus et netus et. Habitant morbi tristique senectus et netus et。 Interdum consectetur libero id faucibus nisl tincidunt eget nullam. Interdum consectetur libero id faucibus nisl tincidunt eget nullam。

I want an XML which includes a chapter with these 3 paragraphs.我想要一个包含这 3 个段落的章节的 XML。

Here is my code:这是我的代码:

Chapter class:章节类:

@Data
@AllArgsConstructor
@NoArgsConstructor
public class Chapter {

    private String paragraph;
    private List<String> sentence;
    private List<String> words;

My main code:我的主要代码:

public static void main(String[] args) {
    String textInputFile = "xml_files/sample.txt";
    String xmlFileOutput = "xml_files/sample.xml";

    try (FileOutputStream outXML = new FileOutputStream(xmlFileOutput))  {
        Scanner inputfile = new Scanner(new File(textInputFile));
        convertToXml(inputfile, outXML);
    }
    catch(Exception e){
    }
}

private static void  convertToXml(Scanner inputfile, FileOutputStream outXML) throws XMLStreamException {
    XMLOutputFactory output = XMLOutputFactory.newInstance();
    XMLStreamWriter writer = output.createXMLStreamWriter(outXML);
    writer.writeStartDocument("utf-8", "1.0");
    writer.writeCharacters("\n");
    // <books>
    writer.writeStartElement("book");
    // <book>
    while (inputfile.hasNext()){
        String line = inputfile.nextLine();
        Chapter chapter = getChapter(line);
        writer.writeCharacters("\n\t");
        writer.writeStartElement("Chapter");
        writer.writeCharacters("\n\t\t");
        writer.writeStartElement("Paragraph");
        writer.writeCharacters(chapter.getParagraph()+"");
        writer.writeEndElement();
        writer.writeCharacters("\n\t\t");
        writer.writeStartElement("Sentence");
        writer.writeCharacters(chapter.getSentence()+"");
        writer.writeEndElement();
        writer.writeCharacters("\n\t");
        writer.writeEndElement();
    }
    writer.writeCharacters("\n");
    writer.writeEndElement();
    writer.writeEndDocument();
}

private static Chapter getChapter(String line){
    String[] paragraphs = line.split("\\r?\\n");
    String[] sentences = line.split("(?<=(?<![A-Z])\\.)");
    Chapter chapter = new Chapter();
    chapter.setParagraph(List.of(paragraphs));
    chapter.setSentence(List.of(sentences));
    return chapter;
}

I'm counting the sentences of each paragraph in the above code, but I don't have any problem there.我正在计算上面代码中每个段落的句子,但我在那里没有任何问题。

My output:我的输出:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<book>

<Chapter Paragraph="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.">
<Paragraph> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</Paragraph>
<Sentence>[Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.]
</Chapter>

   <Chapter Paragraph="" Sentences="[]">
        <Paragraph/>
        <Sentences>[]</Sentences>
    </Chapter>

<Chapter Paragraph="Velit scelerisque in dictum non consectetur a erat. Sit amet justo donec enim diam vulputate. Id aliquet lectus proin nibh nisl condimentum id venenatis a.">
<Paragraph> Velit scelerisque in dictum non consectetur a erat. Sit amet justo donec enim diam vulputate. Id aliquet lectus proin nibh nisl condimentum id venenatis a.</Paragraph>
<Sentence>[Velit scelerisque in dictum non consectetur a erat , Sit amet justo donec enim diam vulputate, Id aliquet lectus proin nibh nisl condimentum id venenatis a.]
</Chapter>

  (...)        

</book>

In the second chapter you can see I have null values inside paragraph and sentence.在第二章中,您可以看到我在段落和句子中有空值。 How can I prevent to print these nulls (I have a chapter with values and the next chapter is always null)?我怎样才能防止打印这些空值(我有一章有值,下一章总是空值)? My second question is how can I have many paragraphs in one chapter?我的第二个问题是如何在一章中包含多个段落? For example, I want every chapter to includes 3 paragraphs.例如,我希望每一章都包含 3 个段落。 Imagine that I have a text file with 10000 lines and I want to structure it into an XML.想象一下,我有一个包含 10000 行的文本文件,我想将其结构化为 XML。

First question: please notice that in your input, you have "empty lines"/linebreaks in your Lorem Ipsum.第一个问题:请注意,在您的输入中,您的 Lorem Ipsum 中有“空行”/换行符。 Scanner.nextLine() reports/provides these lines too. Scanner.nextLine()也报告/提供这些行。 In order to avoid adding Chapter s for these which then result in an empty <Sentences/> in the output, what about adding为了避免为这些添加Chapter ,然后导致输出中的<Sentences/>为空,如何添加

if (line.isEmpty() == true) {
    continue;
}

to your loop after the inputfile.nextLine() ?inputfile.nextLine()之后的循环?

Second question: what about something like第二个问题:类似的东西怎么样

private static void convertToXml(Scanner inputfile, FileOutputStream outXML) throws XMLStreamException {
    List<Chapter> chapters = new ArrayList<Chapter>();

    {
        Chapter chapter = null;

        while (inputfile.hasNext()) {
            String line = inputfile.nextLine();

            if (line.isEmpty() == true) {
                continue;
            }

            String[] sentences = line.split("(?<=(?<![A-Z])\\.)");

            if (chapter == null) {
                chapter = new Chapter();
            }

            chapter.getParagraph().add(line);
            chapter.getSentence().addAll(List.of(sentences));

            if (chapter.getParagraph().size() >= 3) {
                chapters.add(chapter);
                chapter = null;
            }
        }

        if (chapter != null) {
            chapters.add(chapter);
        }
    }

    XMLOutputFactory output = XMLOutputFactory.newInstance();
    XMLStreamWriter writer = output.createXMLStreamWriter(outXML);
    writer.writeStartDocument("utf-8", "1.0");
    writer.writeCharacters("\n");
    writer.writeStartElement("book");
    writer.writeCharacters("\n");

    for (Chapter chapter : chapters) {
        writer.writeCharacters("\t");
        writer.writeStartElement("Chapter");
        writer.writeCharacters("\n");

        for (String paragraph : chapter.getParagraph()) {
            writer.writeCharacters("\t\t");
            writer.writeStartElement("Paragraph");
            writer.writeCharacters(paragraph);
            writer.writeEndElement();
            writer.writeCharacters("\n");
        }

        writer.writeCharacters("\t\t");
        writer.writeStartElement("Sentence");
        writer.writeCharacters(chapter.getSentence()+"");
        writer.writeEndElement();
        writer.writeCharacters("\n\t");
        writer.writeEndElement();
        writer.writeCharacters("\n");
    }

    writer.writeCharacters("\n");
    writer.writeEndElement();
    writer.writeEndDocument();
}

with a Chapter.java like像 Chapter.java 一样

public class Chapter {

    private List<String> paragraph = new ArrayList<String>();
    private List<String> sentence = new ArrayList<String>();

    public List<String> getParagraph() {
        return paragraph;
    }

    public List<String> getSentence() {
        return sentence;
    }
}

and the getChapter() not needed (or you may put the plaintext file reading and XML output generation into separate methods, etc.)?并且getChapter() (或者您可以将纯文本文件读取和 XML 输出生成放入单独的方法等中)?

Please be aware, with my proposal, you keep all the Chapter objects and paragraph strings in memory.请注意,根据我的建议,您将所有Chapter对象和段落字符串保存在内存中。 If you want to avoid this, you can mingle input file processing and output generation back together.如果你想避免这种情况,你可以将输入文件处理和输出生成混合在一起。 I just separated the two for better illustration of how to arrange the collection of paragraphs.我只是将两者分开,以便更好地说明如何安排段落的集合。 You could easily write out a Chapter once it has collected 3 paragraphs + at the end of the loop (in case there's a remaining Chapter object not written out yet), and not grow a List<Chapter> .一旦在循环结束时收集了 3 个段落 +,您就可以轻松地写出一个Chapter (以防还有一个剩余的Chapter对象尚未写出),而不是增长一个List<Chapter>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM