没有第三方库，如何完整解析HTML？

Question

I am puzzled with this question. 我对这个问题感到困惑。

I can parse a HTML like below way. 我可以按以下方式解析HTML。

package org.owls.parser.html;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;

public class HTMLParser {
    public static String getHTTPStringsFromWeb(String urlStr) throws Exception {
        StringBuffer sb = new StringBuffer();
        URL url = new URL(urlStr);
        HttpURLConnection con = (HttpURLConnection) url.openConnection();

        BufferedReader br = null;
        if(con.getResponseCode() == HttpURLConnection.HTTP_OK)
        {
            br = new BufferedReader(new InputStreamReader(con.getInputStream()));
            String line = "";
            while((line = br.readLine()) != null){
                sb.append(line);
            }
            br.close();
        }
        return sb.toString();
    }
}

This code works well, but there is a problem. 该代码运行良好，但是存在问题。 This code can not get dynamic data which made of ajax result. 此代码无法获取由ajax结果组成的动态数据。

So I want to get full page. 所以我想得到整页。 Is it possible? 可能吗？

People talk about jsoup, but I want to know is there anyway to get this with native. 人们都在谈论jsoup，但是我想知道到底有没有使用本地语言实现的。

Thanks :D 感谢：D

Answer 1

There is an inherent problem in what you are trying to do, you need a web browser/environment to execute the ajax requests. 您尝试执行的操作存在一个固有的问题，您需要一个Web浏览器/环境来执行ajax请求。 reading them into a string and looking for url's is not enough, the functions may be doing something special with the data that you won't be able to support. 将它们读取为字符串并查找url是不够的，这些函数可能会对无法支持的数据做一些特殊的事情。

You will have to use something like phantomjs which can load and parse pages in a headless environment 您将不得不使用诸如phantomjs之类的东西，它们可以在无头环境中加载和解析页面

没有第三方库，如何完整解析HTML？

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-04-15 18:29:10

没有第三方库，如何完整解析HTML？

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-04-15 18:29:10

解决方案1
0 已采纳 2015-04-15 18:29:10