简体   繁体   中英

Android Jsoup Parser very slow on kitkat

Jsoup seems to parse things much slower on kitkat then on anything before kitkat. Im not sure if its the ART runtime but after running a speed test on a parsing method and found it to be about 5 times slower And ive no idea why..

This part of my code is running in the doInBackground of an Async task.

    JsoupParser parser = new JsoupParser();
    parser.setPath(String.valueOf(application.getCacheDir()));

    Collection<Section> allSections = eguide.getSectionMap().values();
    for (Section section : allSections) {
         parser.createNewAssetList();
         parser.setContent(section.color, section.name, section.text, section.slug);
         if (!TextUtils.isEmpty(section.text)) {
            section.text = parser.setWebViewStringContent();
            section.assets = parser.getAssets();
            for (Asset asset : section.assets)
                asset.heading = section.heading;
         }
    } 

I wrote this ages ago and its probably not very efficient but it sets up the parser, loads a list of Section objects, for each object it parses the html extracting table and image into a list of different objects which are returned to the original section object..

This is my parser class.

public class JsoupParser{

private List<Asset> assets;
private int assetCount;
private String slug,name,color,path;
private Document doc;

public JsoupParser() {
    assetCount = 0;
    assets = new ArrayList<Asset>();
}

public void setPath(String path) {
    this.path = path;
}

public void setContent(String color, String name, String text, String slug){
    this.color = color;
    this.name = name;
    this.slug = slug;
    doc = Jsoup.parse(text);
}

public void createNewAssetList(){
    assetCount = 0;
    assets = new ArrayList<Asset>();
}

public String setWebViewStringContent() {

    addScriptsAndDivTags();

    //parse images
    Elements images  = doc.select("img[src]");
    parseImages(images);

    //parse tables
    Elements tableTags = doc.select("table");
    parseTables(tableTags);

    return doc.toString();
}

private void addScriptsAndDivTags() {

    Element bodyReference = doc.select("body").first(); //grab head and body ref's
    Element headReference = doc.select("head").first();

    Element new_body = doc.createElement("body");
    //wrap content in extra div and add accodrion tag
    bodyReference.tagName("div");
    bodyReference.attr("id", "accordion");
    new_body.appendChild(bodyReference);
    headReference.after(new_body);
}

private void parseTables(Elements tableTags) {
    if (tableTags != null) {
        int count = 1;
        for (Element table : tableTags) {
            Asset item = new Asset();
            item.setContent(table.toString());
            item.setColor(color);
            item.id = (int) Math.ceil(Math.random() * 10000);
            item.isAsset=1;
            item.keywords = table.attr("keywords");
            String linkHref = table.attr("table_name");
            item.slug = "t_" + slug + " " + count ;
            if(!TextUtils.isEmpty(linkHref)){
               item.name = linkHref;
            }
            else{
               item.name ="Table-" + (assetCount + 1) + " in " + name;
            }
            // replace tables
            String inline = table.attr("inline");
            String button = ("<p>Dummy Button</p>");

            if(!TextUtils.isEmpty(inline)&& inline.contentEquals("false") || TextUtils.isEmpty(inline) )
            {
              table.replaceWith(new DataNode(button, ""));
            }
            else{
                Element div = doc.createElement("div");
                div.attr("class","inlineTableWrapper");
                div.attr("onclick", "window.location ='table://"+item.slug+"';");
                table.replaceWith(div);
                div.appendChild(table);
            }
            assets.add(item);
            assetCount++;
            count++;
        }
    }
}

private void parseImages(Elements images) {
    for (Element image : images) {
        Asset item = new Asset();

        String slug = image.attr("src");
        //remove first forward slash from slug to account for img:// protocol in image linking
        if(slug.charAt(0)=='/')
            slug = slug.substring(1,slug.length());
        image.attr("src", path +"/images/" + slug.substring(slug.lastIndexOf("/")+1, slug.length()));
        image.attr("style", "px; border:1px solid #000000;");
        String image_name = image.attr("image_name");
        if(!TextUtils.isEmpty(image_name)){
           item.name = image_name;
        }
        else{
           item.name ="Image " + (assetCount + 1) + " in " + name;
        }

        // replace tables
        String inline = image.attr("inline");

        String button = ("<p>Dummy Button</p>");
        item.setContent(image.toString()+"<br/><br/><br/><br/>");
        if(!TextUtils.isEmpty(inline)&& inline.contentEquals("false"))
        {
            image.replaceWith(new DataNode(button, ""));
        }
        else{
           image.attr("onclick", "window.location ='img://"+slug+"';");
        }

        item.keywords = image.attr("keywords");
        item.setColor(color);
        item.id = (int) Math.ceil(Math.random() * 10000);
        item.slug = slug;
        item.isAsset =2;
        assets.add(item);
        assetCount++;
    }
}

public String getName() {
    return name;
}

public List<Asset> getAssets() {
    return assets;
}
}

Again its probably not very efficient but i have so far been unable to find out why it takes such a performance hit on kitkat. Any information would be greatly appreciates. Thanks!

Update Apr. 7, 2015 The author of jsoup incorporated my suggestion into the main trunk, at this point checking for ASCII or UTF encoding and skipping the slow (on Android 4.4 and 5) canEncode() call, so just update your jsoup source tree and build again, or pull his latest jar.

Earlier comments and explanation of the issue: I found what the problem was, at least in my app - the Entities.java module of jsoup has an escape() function - used eg by Element.outerHtml() call for all text nodes. Among other things, it tests each character of each text node if it can be encoded with the current encoder:

 if (encoder.canEncode(c))
    accum.append(c);
 else...

The canEncode() call is extremal slow on Android KitKat and Lollipop. As my HTML output is only in UTF-8, and Unicode can encode virtually any character, this check is not necessary. I changed it by testing at the beginning of escape() function:

boolean encIsUnicode = encoder.charset().name().toUpperCase().startsWith("UTF-");

then later, when the test is needed:

if (encIsUnicode || encoder.canEncode(c))
    accum.append(c);
else ...

Now my app works like a charm on KitKat and Lollipop too - what previously took 10 seconds, now takes less than 1 second. I issued a pull request to the main jsoup repository with this change and a few smaller optimizations I made. Not sure if the jsoup author will merge it. If you want, check my fork at:

https://github.com/gregko/jsoup

If you work with some other encoding(s) which you know in advance, you could add your own tests (eg see if the character is ASCII or whatever) to avoid the costly canEncode(c) call.

Greg

you do use lots of string concatenating (which may be a killer for large amounts of data)

item.name ="Table-" + (assetCount + 1) + " in " + name;

according to this post: Is it always a bad idea to use + to concatenate strings - you should avoid concating in loops - which is the case with your code.. how about:

item.name = String.format("Table-%s in %s", assetCount + 1, name);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM