简体   繁体   English

如何提取内容 <tr> 使用正则表达式从HTML文档添加标签?

[英]How to extract contents of <tr> tags from html document using regex?

I have a document which contains data about every country. 我有一个文档,其中包含每个国家/地区的数据。 Every table row is one country: 每个表格行都是一个国家/地区:

<tr>
    <td class="td-flag"><a href="/afghanistan"><img alt="Flag of Afghanistan"  src="//flags.fmcdn.net/data/flags/mini/af.png" width="30" height="20" /></a></td>
    <td class="td-country"><a href="/afghanistan">Afghanistan</a></td>
    <td class="td-capital">Kabul</td>
    <td class="td-population">25,500,100</td>
    <td class="td-area">652,090&nbsp;km<sup>2</sup></td>
</tr>

I try to extract: link to the flag, name of the country, captiol and population, but first I need to insert every table row in Vector separately, so I need to extract contents of every <tr>content</tr> . 我尝试提取:链接到国旗,国家名称,Captiol和人口,但是首先我需要在Vector分别插入每个表行,因此我需要提取每个<tr>content</tr>

Question: How to extract contents of every <tr> in the html document? 问题:如何提取html文档中每个<tr>的内容? I have no matches at all: 我根本没有比赛:

try {
            BufferedReader br = new BufferedReader(new FileReader("./data/countries.txt"));
            StringBuilder sb = new StringBuilder();
            String line;
            while ((line = br.readLine()) != null) {
                sb.append(line + '\n');
            }
            br.close();

            ArrayList<String> tableRows = new ArrayList<String>();
            Pattern p = Pattern.compile(" <tr>(\\w+)</tr> ", Pattern.MULTILINE);
            Matcher m = p.matcher(sb);
            while (m.find()) {
                System.out.println("match");//it never prints thus there are no matches
                tableRows.add(m.group());
            }
            System.out.println(tableRows.size());//THE SIZE is 0
            for (String tr : tableRows) {
                System.out.println(tr);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

There are much simpler ways to extract data from an HTML file, notably : 有很多简单的方法可以从HTML文件提取数据,特别是:

Regex works too but is more prone to error than the technologies aforementioned. 正则表达式也可以工作,但是比前面提到的技术更容易出错。

++ Edit ++ ++ 编辑 ++

  • XPath example XPath示例

I have to admit, XPath is quite new to me so the following code isn't the most optimized, but it will give you a quick idea of how it works. 我必须承认,XPath对我来说是一个很新的东西,因此以下代码不是最优化的,但是它将使您快速了解它的工作方式。 You can practice using XPath in your browser's console. 您可以在浏览器的控制台中练习使用XPath。 Open your HTML page and wrap your expression with $x(EXPRESSION); 打开HTML页面,并用$x(EXPRESSION);包装$x(EXPRESSION); .

$x("//tr/td[@class='td-flag']/a/@href") will render : Array [ href="/afghanistan" ] $x("//tr/td[@class='td-flag']/a/@href")将呈现: Array [ href="/afghanistan" ]

  • jQuery example jQuery的例子

If you have never used jQuery before, you can also play with it in your browser's console. 如果您以前从未使用过jQuery,也可以在浏览器的控制台中使用它。 It's pretty much a JavaScript library with a sole purpose of code simplification. 这几乎是一个JavaScript库,其唯一目的是简化代码。

$(".td-flag a").href will render "file:///afghanistan" $(".td-flag a").href将呈现"file:///afghanistan"

I used your code snippet above with just one tr element, but obviously you have more tr elements so the expressions above return arrays. 我在上面仅使用了一个tr元素使用了您的代码段,但是显然您有更多tr元素,因此上面的表达式返回了数组。 Also, place an ID tag on your table element for easy and safe access ;-) 另外,在您的表格元素上放置一个ID标记,以便轻松,安全地访问;-)

Adding on to the JQuery answer, there is also JSoup , which allows you to do JQuery-Style queries in Java: 除了JQuery答案,还有JSoup ,它允许您使用Java进行JQuery-Style查询:

Document doc = Jsoup.connect("<your url here>").get();
Elements rows = doc.select("tr");
for(Element row : rows){
    String country = row.getElementByClass("td-country").text();
    // etc.
}

If that sort of data is online, i mean if your document is online, i would suggest you that you can use even tools like import.io to create an API specific for your use case. 如果这类数据在线,那意味着您的文档在线,我建议您甚至可以使用import.io之类的工具来创建针对您的用例的API。

The response is in JSON format and its pretty easy to work with that using jQuery. 响应为JSON格式,并且使用jQuery很容易使用。

I prefer to use import.io when i have to work with tabular data on the web other then creating some sort of parsers based on DOM elements. 当我不得不在Web上处理表格数据时,我更喜欢使用import.io,然后根据DOM元素创建某种解析器。

You can always use jQuery and save all the data in JSON format, you will need to create a javascript parser, so it parses the data from the rest of the document and then you add this info you exctract this info you collected as JSON so you can use it everywhere. 您始终可以使用jQuery并将所有数据保存为JSON格式,您将需要创建一个javascript解析器,以便它解析文档其余部分中的数据,然后添加此信息以吸引您以JSON格式收集的信息,因此您可以在任何地方使用它。

// defining variables
var flag = $('td.td-flag img').prop('src');
var country = $('td.td-country a').html();
var capital = $('td.td-capital').html();
var population = $('td.td-population').html();
var area = $('td.td-area').html();

Now this is only a part of the parser, this only extracts the data for a row of data, if you have mutliple rows of data you will need to run an foreach (each in javascript) loop that loops through all table elements and reads them all (using the variables defined above) ... and at the end them as array or export into JSON format. 现在,这只是解析器的一部分,它仅提取一行数据的数据,如果您有多行数据,则需要运行一个遍历所有表元素并读取它们的foreach(每个javascript中)循环全部(使用上面定义的变量)...,最后将它们作为数组或导出为JSON格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM