I am writing a script that cleans up web pages. This includes iterating through all the tags (elements) and checking against certain rules:
for (Element element : document.select("*") ) {
if (element == null) {
continue;
}
if ( RULE1) ) {
element.remove();
}
else if( RULE2){
element.remove();
}
else if ( RULE3 ) {
element.remove();
}
else if ( RULE4 ) {
element.remove();
}
}
I have tested this on tens of pages without a problem. Today I just hit a web page throws java.lang.IndexOutOfBoundsException
:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 3, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.remove(ArrayList.java:492)
at org.jsoup.nodes.Node.removeChild(Node.java:423)
at org.jsoup.nodes.Node.remove(Node.java:266)
My guess is that at some point the code tries to remove an element that's already removed. But can't tell how/why this should happen.
Any idea?
Thanks.
Edit 1: Rule causing break
I found out the rule that's causing the code to fail. One of the rules actually doesn't remove the element
but resets its text:
else if ( matches junk text ) {
String match = getMatchingJunk ( element.ownText() );
if ( match.length() < JUNK_TEXT_ELEMENT_REMOVAL_THRESH ) {
element.text( removeSmallest(element.ownText(), match) ); // <= causing error
continue;
}
element.remove();
}
If I remove the line element.text( removeSmallest(element.ownText(), match) )
the error disappears.
The code seems to work if I purge junk text in 2 phases. The code looks a bit repetitive and hackish. Probably there's a better way to do it:
1st Phase: Collect all junks
Map <String, Element> junks = new HashMap <String, Element>();
for (Element element : document.select("*") ) {
...
if () {
...
}
else if ( matches junk text ) {
String match = getMatchingJunk ( element.ownText() );
if ( match.length() < JUNK_TEXT_ELEMENT_REMOVAL_THRESH ) {
//element.text( removeSmallest(element.ownText(), match) ); // <= causing error
junks.put(elOwnText,element);
continue;
}
element.remove();
}
}
2nd phase: Purge junk
if ( size(junks) > 0 ) {
for(Map.Entry<String,Element> ent : junks.entrySet()){
String match = getMatchingJunk (ent.getKey()); // this looks repetitive. probably there's a better way to do it
if ( match.length() < JUNK_TEXT_ELEMENT_REMOVAL_THRESH ) {
ent.getValue().text( removeSmallest(ent.getKey(), match) ); // purge junk
}
} // end for
} // end if
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.