简体   繁体   中英

parsing html with issues in android app

I'm trying to parse html of webpage in my android app using Jsoup, I have encountered a problem with this specific page: http://techmvs.technion.ac.il/cics/wmn/wmngrad?ORD=1

this string:

<!doctype html public "-//ietf//dtd html 3.0//en">:

appears in the headers sections while it is obviously not one, when I try to execute the Jsoup connection in the next code line:

Response r = Jsoup.connect("http://techmvs.technion.ac.il/cics/wmn/wmngrad?ORD=1").followRedirects(true).execute();

it appears to take this bad header as a header with empty name and empty value, which causes an exception, here is the stack:

W: java.lang.IllegalArgumentException: Header name must not be empty
W:     at org.jsoup.helper.Validate.notEmpty(Validate.java:102)
W:     at org.jsoup.helper.HttpConnection$Base.header(HttpConnection.java:292)
W:     at org.jsoup.helper.HttpConnection$Response.processResponseHeaders(HttpConnection.java:828)
W:     at org.jsoup.helper.HttpConnection$Response.setupFromConnection(HttpConnection.java:772)
W:     at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:569)
W:     at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540)
W:     at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227)
W:     at gavi_anna_netanel.com.madomes.ug_login.GradesParser$GradesFetcher.getGradesList(GradesParser.java:48)
W:     at gavi_anna_netanel.com.madomes.ug_login.GradesParser$GradesFetcher.doInBackground(GradesParser.java:32)
W:     at gavi_anna_netanel.com.madomes.ug_login.GradesParser$GradesFetcher.doInBackground(GradesParser.java:28)
W:     at android.os.AsyncTask$2.call(AsyncTask.java:287)
W:     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:305)
W:     at java.util.concurrent.FutureTask.run(FutureTask.java:137)
W:     at android.os.AsyncTask$SerialExecutor$1.run(AsyncTask.java:230)
W:     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1076)
W:     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:569)
W:     at java.lang.Thread.run(Thread.java:856)

important to say that as a JAVA project it did work (although the bad header appeared in the html as well).

is there a way to tell Jsoup to ignore bad headers and connect to the url anyway? if not, is there another client that will not fail on android because of this bad header?

thanks

您是否尝试过使用XML解析器代替HTML解析器,而使用parseBodyFragment()代替parse()

Document doc = Jsoup.parseBodyFragment(html, "", Parser.xmlParser());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM