简体   繁体   English

javax.swing.text.ElementIterator奇怪的行为

[英]javax.swing.text.ElementIterator weird behavior

I'm getting a weird behavior with javax.swing.text.ElementIterator(). 我在使用javax.swing.text.ElementIterator()时遇到了奇怪的行为。 It never shows all elements, and it shows a different amount of elements depending on what type of ParserCallback I use. 它永远不会显示所有元素,并且会显示不同数量的元素,具体取决于我使用哪种类型的ParserCallback。 The test below is done with the website that is in my profile, but can be done with any other big html file. 以下测试是使用我个人资料中的网站完成的,但可以使用任何其他较大的html文件完成。

// some imports shown in case its an import mixup
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.ChangedCharSetException;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.StyleConstants;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

// Shows whats in an element, recursively
public void printElement(HTMLDocument htmlDoc, Element element)
        throws BadLocationException
{
    AttributeSet attributes = element.getAttributes();
    System.out.println("element: '" + element.toString().trim() + "', name: '" + element.getName() + "', children: " + element.getElementCount() + ", attributes: " + attributes.getAttributeCount() + ", leaf: " + element.isLeaf());
    Enumeration attrEnum = attributes.getAttributeNames();
    while (attrEnum.hasMoreElements())
    {
        Object attr = attrEnum.nextElement();
        System.out.println("\tAttribute: '" + attr + "', Val: '" + attributes.getAttribute(attr) + "'");
        if (attr == StyleConstants.NameAttribute
                && attributes.getAttribute(StyleConstants.NameAttribute) == HTML.Tag.CONTENT)
        {
            int startOffset = element.getStartOffset();
            int endOffset = element.getEndOffset();
            int length = endOffset - startOffset;
            System.out.printf("\t\tContent (%d-%d): '%s'\n", startOffset, endOffset, htmlDoc.getText(startOffset, length).trim());
        }
    }
    for (int i = 0; i < element.getElementCount(); i++)
    {
        Element child = element.getElement(i);
        printElement(htmlDoc, child);
    }
}

public void tryParse(String filename) 
        throws FileNotFoundException, IOException, BadLocationException
{
    BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filename)));

    Parser parser = new ParserDelegator();
    HTMLEditorKit htmlKit = new HTMLEditorKit();
    HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
    ParserCallback callback2 = htmlDoc.getReader(0);
    ParserCallback callback1 =
            new HTMLEditorKit.ParserCallback()
            {
            };

    parser.parse(in, callback2, true);
    ElementIterator iterator = new ElementIterator(htmlDoc);
    Element element;
    while ((element = iterator.next()) != null)
        printElement(htmlDoc, element);
    in.close();
}

In the test above, the results vary if I use callback1 or callback2. 在上面的测试中,如果我使用callback1或callback2,结果会有所不同。 Even weirder, if I do fill the callbacks with the appropriate functions and have them output something, they show that the parser does handle the whole website, but the ElementIterator still doesn't have it all. 即使很奇怪,如果我确实用适当的函数填充回调并让它们输出某些内容,它们也表明解析器确实可以处理整个网站,但是ElementIterator仍然不具备全部功能。

I've also tried to use htmlKit.read() instead of parser.parse(), but it still doesn't work. 我也尝试过使用htmlKit.read()代替parser.parse(),但是它仍然无法正常工作。

Although I'm now getting my desired results by using the parser callback functions (not shown here), I still wonder why ElementIterator doesn't work as expected in case I need it later, so I wonder if anyone here has experience with that ElementIterator and can answer. 尽管我现在通过使用解析器回调函数(此处未显示)获得所需的结果,但我仍然想知道为什么ElementIterator不能按预期工作,以防以后需要它,所以我想知道这里是否有人有使用该ElementIterator的经验并可以回答。

Update: Complete Java Source uploaded here: http://home.snafu.de/tilman/tmp/Main.java 更新:完整的Java源代码在这里上传: http : //home.snafu.de/tilman/tmp/Main.java

Using the approach seen here , I haven't noticed the problem you describe. 使用此处看到的方法,我没有注意到您描述的问题。 I added a println() , and all the elements seem to be there. 我添加了println() ,所有元素似乎都在那里。

Addendum: I'm not sure how your tryParse() fails, but your printElement() seems to work from my main() : 附录:我不确定您的tryParse()如何失败,但是您的printElement()似乎可以从我的main()

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Enumeration;
import javax.swing.text.AttributeSet;
import javax.swing.text.BadLocationException;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.StyleConstants;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;

/** @see https://stackoverflow.com/questions/2882782 */
public class NewMain {

    public static void main(String args[]) throws Exception {
        HTMLEditorKit htmlKit = new HTMLEditorKit();
        HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        htmlKit.read(new BufferedReader(new FileReader("test.html")), htmlDoc, 0);
        ElementIterator iterator = new ElementIterator(htmlDoc);
        Element element;
        while ((element = iterator.next()) != null) {
            printElement(htmlDoc, element);
        }
    }
    private static void printElement(HTMLDocument htmlDoc, Element element)
        throws BadLocationException {
        AttributeSet attrSet = element.getAttributes();
        System.out.println(""
            + "Element: '" + element.toString().trim()
            + "', name: '" + element.getName()
            + "', children: " + element.getElementCount()
            + ", attributes: " + attrSet.getAttributeCount()
            + ", leaf: " + element.isLeaf());
        Enumeration attrNames = attrSet.getAttributeNames();
        while (attrNames.hasMoreElements()) {
            Object attr = attrNames.nextElement();
            System.out.println("  Attribute: '" + attr + "', Value: '"
                + attrSet.getAttribute(attr) + "'");
            Object tag = attrSet.getAttribute(StyleConstants.NameAttribute);
            if (attr == StyleConstants.NameAttribute
                && tag == HTML.Tag.CONTENT) {
                int startOffset = element.getStartOffset();
                int endOffset = element.getEndOffset();
                int length = endOffset - startOffset;
                System.out.printf("    Content (%d-%d): '%s'\n", startOffset,
                    endOffset, htmlDoc.getText(startOffset, length).trim());
            }
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM