简体   繁体   English

jsoup在element.remove()上抛出IndexOutOfBoundsException

[英]jsoup throwing IndexOutOfBoundsException on element.remove()

I am writing a script that cleans up web pages. 我正在写一个清理网页的脚本。 This includes iterating through all the tags (elements) and checking against certain rules: 这包括遍历所有标签(元素)并根据某些规则进行检查:

    for (Element element :  document.select("*") ) {
        if (element == null) { 
            continue;
        }


        if ( RULE1) ) {
            element.remove();
        }


        else if( RULE2){
            element.remove();
        }


        else if ( RULE3 ) {
            element.remove();
        }

        else if (  RULE4 ) {
            element.remove();
        }

    }   

I have tested this on tens of pages without a problem. 我已经在数十页上对此进行了测试,没有任何问题。 Today I just hit a web page throws java.lang.IndexOutOfBoundsException : 今天,我刚刚在网页上抛出java.lang.IndexOutOfBoundsException

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 3, Size: 1
    at java.util.ArrayList.rangeCheck(ArrayList.java:653)
    at java.util.ArrayList.remove(ArrayList.java:492)
    at org.jsoup.nodes.Node.removeChild(Node.java:423)
    at org.jsoup.nodes.Node.remove(Node.java:266)

My guess is that at some point the code tries to remove an element that's already removed. 我的猜测是,代码有时会尝试删除已经删除的元素。 But can't tell how/why this should happen. 但是无法确定为什么/为什么会这样。

Any idea? 任何想法?

Thanks. 谢谢。

Edit 1: Rule causing break 编辑1:导致中断的规则

I found out the rule that's causing the code to fail. 我发现了导致代码失败的规则。 One of the rules actually doesn't remove the element but resets its text: 规则之一实际上不会删除element而是会重置其文本:

        else if ( matches junk text ) {
            String match = getMatchingJunk ( element.ownText() );
            if ( match.length()  < JUNK_TEXT_ELEMENT_REMOVAL_THRESH ) {
                element.text( removeSmallest(element.ownText(), match) ); // <= causing error
                continue;
            }

            element.remove();

        }

If I remove the line element.text( removeSmallest(element.ownText(), match) ) the error disappears. 如果我删除了element.text( removeSmallest(element.ownText(), match) ) ,错误就会消失。

The code seems to work if I purge junk text in 2 phases. 如果我分两个阶段清除垃圾文本,该代码似乎可以正常工作。 The code looks a bit repetitive and hackish. 该代码看起来有些重复和骇人听闻。 Probably there's a better way to do it: 也许有更好的方法可以做到这一点:

1st Phase: Collect all junks 第一阶段:收集所有垃圾

        Map <String, Element> junks = new HashMap <String, Element>();
        for (Element element :  document.select("*") ) {
            ...

            if () {
                ...
            }

            else if ( matches junk text ) {
                String match = getMatchingJunk ( element.ownText() );
                if ( match.length()  < JUNK_TEXT_ELEMENT_REMOVAL_THRESH ) {
                    //element.text( removeSmallest(element.ownText(), match) ); // <= causing error
                    junks.put(elOwnText,element);
                    continue;
                }

                element.remove();

            }
        }

2nd phase: Purge junk 第二阶段:清除垃圾

    if ( size(junks) > 0 ) {
        for(Map.Entry<String,Element> ent : junks.entrySet()){

            String match = getMatchingJunk (ent.getKey()); // this looks repetitive. probably there's a better way to do it
            if ( match.length()  < JUNK_TEXT_ELEMENT_REMOVAL_THRESH ) {
                ent.getValue().text( removeSmallest(ent.getKey(), match) ); // purge junk

            }
        } // end for
    } // end if

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM