简体   繁体   中英

How to extract contents of <tr> tags from html document using regex?

I have a document which contains data about every country. Every table row is one country:

<tr>
    <td class="td-flag"><a href="/afghanistan"><img alt="Flag of Afghanistan"  src="//flags.fmcdn.net/data/flags/mini/af.png" width="30" height="20" /></a></td>
    <td class="td-country"><a href="/afghanistan">Afghanistan</a></td>
    <td class="td-capital">Kabul</td>
    <td class="td-population">25,500,100</td>
    <td class="td-area">652,090&nbsp;km<sup>2</sup></td>
</tr>

I try to extract: link to the flag, name of the country, captiol and population, but first I need to insert every table row in Vector separately, so I need to extract contents of every <tr>content</tr> .

Question: How to extract contents of every <tr> in the html document? I have no matches at all:

try {
            BufferedReader br = new BufferedReader(new FileReader("./data/countries.txt"));
            StringBuilder sb = new StringBuilder();
            String line;
            while ((line = br.readLine()) != null) {
                sb.append(line + '\n');
            }
            br.close();

            ArrayList<String> tableRows = new ArrayList<String>();
            Pattern p = Pattern.compile(" <tr>(\\w+)</tr> ", Pattern.MULTILINE);
            Matcher m = p.matcher(sb);
            while (m.find()) {
                System.out.println("match");//it never prints thus there are no matches
                tableRows.add(m.group());
            }
            System.out.println(tableRows.size());//THE SIZE is 0
            for (String tr : tableRows) {
                System.out.println(tr);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

There are much simpler ways to extract data from an HTML file, notably :

Regex works too but is more prone to error than the technologies aforementioned.

++ Edit ++

  • XPath example

I have to admit, XPath is quite new to me so the following code isn't the most optimized, but it will give you a quick idea of how it works. You can practice using XPath in your browser's console. Open your HTML page and wrap your expression with $x(EXPRESSION); .

$x("//tr/td[@class='td-flag']/a/@href") will render : Array [ href="/afghanistan" ]

  • jQuery example

If you have never used jQuery before, you can also play with it in your browser's console. It's pretty much a JavaScript library with a sole purpose of code simplification.

$(".td-flag a").href will render "file:///afghanistan"

I used your code snippet above with just one tr element, but obviously you have more tr elements so the expressions above return arrays. Also, place an ID tag on your table element for easy and safe access ;-)

Adding on to the JQuery answer, there is also JSoup , which allows you to do JQuery-Style queries in Java:

Document doc = Jsoup.connect("<your url here>").get();
Elements rows = doc.select("tr");
for(Element row : rows){
    String country = row.getElementByClass("td-country").text();
    // etc.
}

If that sort of data is online, i mean if your document is online, i would suggest you that you can use even tools like import.io to create an API specific for your use case.

The response is in JSON format and its pretty easy to work with that using jQuery.

I prefer to use import.io when i have to work with tabular data on the web other then creating some sort of parsers based on DOM elements.

You can always use jQuery and save all the data in JSON format, you will need to create a javascript parser, so it parses the data from the rest of the document and then you add this info you exctract this info you collected as JSON so you can use it everywhere.

// defining variables
var flag = $('td.td-flag img').prop('src');
var country = $('td.td-country a').html();
var capital = $('td.td-capital').html();
var population = $('td.td-population').html();
var area = $('td.td-area').html();

Now this is only a part of the parser, this only extracts the data for a row of data, if you have mutliple rows of data you will need to run an foreach (each in javascript) loop that loops through all table elements and reads them all (using the variables defined above) ... and at the end them as array or export into JSON format.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM