简体   繁体   中英

Parse HTML table without IDs or CSS selectors in Node.js

This data is from an old system and the output is as is. We cannot add CSS selectors or IDs. Most of the examples online for node.js parsing involves parsing tables, rows, data with some ID or CSS classes but so far I haven't run into anything that can help parse the page below. This includes examples for JSDOM (AFAIK).

What I would like is to extract each of the rows into [fileName, link, size, dateTime] tuples on which I can then run some queries like what was the latest timestamp in the group, etc and then extract the filename and link - was thinking of using YQL. The alternating table row attributes is also making it a bit challenging. New to node.js so some of the terminology might be wrong. Any help will be appreciated.

Thanks.

<html>
<body>
    <table width="100%" cellspacing="0" cellpadding="5" align="center">
        <tr> 
        <td align="left"><font size="+1"><strong>Filename</strong></font></td>
        <td align="center"><font size="+1"><strong>Size</strong></font></td>
        <td align="right"><font size="+1"><strong>Last Modified</strong></font></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file1.csv</tt></a></td>
        <td align="right"><tt>86.6 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file2.csv</tt></a></td>
        <td align="right"><tt>20.7 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file1.xml</tt></a></td>
        <td align="right"><tt>266.5 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file2.xml</tt></a></td>
        <td align="right"><tt>27.2 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
    </table>
</body>
</html>

Answer (thanks @Enragedmrt):

    res.on('data', function(data) {

        $ = cheerio.load(data.toString());
        var data = [];
        $('tr').each(function(i, tr){

            var children = $(this).children();
            var fileItem = children.eq(0);
            var linkItem = children.eq(0).children().eq(0);
            var lastModifiedItem = children.eq(2);

            var row = {
                "Filename": fileItem.text().trim(),
                "Link": linkItem.attr("href"),
                "LastModified": lastModifiedItem.text().trim()
            };
            data.push(row);
            console.log(row);
        });
    });

I would suggest using Cheerio over JSDOM as it's significantly faster and more lightweight. That said, you'll need to do a for each loop grabbing up the 'tr' elements and subsequently their 'td' elements. Here's a rough example (My Node.js/Cheerio is rusty, but if you dig around in JQuery you can find some decent examples):

var data = [];
$('tr').each(function(i, tr){
    var children = $(this).children();
    var row = {
        "Filename": children[0].text(),
        "Size": children[1].text(),
        "Last Modified": children[2].text()
    };
    data.push(row);
});

I don't know JSDom, but it sounds like it can parse a HTML document into a DOM (Document Object Model). From there it should be very possible to loop through the nodes and recognise them by tag name, attributes or position in the document, even if they don't have ids.

Googling for 5 seconds, please hold on ...

JSDom's documentation on GitHub seems to confirm this. It shows jQuery-like selectors, like window.$("a.the-link").text() . So instead of adding a class, you can select for selectors like td , th , or probably even td[align="left"] . Using selectors like that, and convenient methods like .first and .each , to traverse over multiple results (like every row) you should be able to parse the document just fine, although it will of course be a bit more cumbersome than having convenient classnames for every different kind of cell.

I still don't think I'm a JSDom expert, but reading their project's main page for a couple of minutes already shows all the answers to your questions, and much more.

JSFiddle

var rawData = new Array();
var rows = document.getElementsByTagName('tr');
for(var cnt = 1; cnt < rows.length; cnt++) {
    var cells = rows[cnt].getElementsByTagName('tt');
    var row = [];
    for (var count = 0; count < cells.length; count++) {
        row.push(cells[count].innerText.trim());
    }    
    rawData.push(row);
}

console.log(rawData);

Additional way

var cheerio = require('cheerio'),
    cheerioTableparser = require('cheerio-tableparser');

res.on('data', function(data) {

    $ = cheerio.load(data.toString());
    cheerioTableparser($);
    var data = [];
    var array = $("table").parsetable(false, false, false)
    array[0].forEach(function(d, i) {

        var firstColumnHTMLCell = $("<div>" + array[0][i] + "</div>");
        var fileItem = firstColumnHTMLCell.text().trim();
        var linkItem = firstColumnHTMLCell.find("a").attr("href");
        var lastModifiedItem = $("<div>" + array[2][i] + "</div>").text();

        var row = {
            "Filename": fileItem,
            "Link": linkItem,
            "LastModified": lastModifiedItem
        };

        data.push(row);
        console.log(row);
    })
});

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM