简体   繁体   English

使用node.js从URL解析XML文件并循环以获取所有URL

[英]Parsing xml file from url and loop to get all urls in it using node.js

I'm using node module xml2js . 我正在使用节点模块xml2js My xml file is of the form.: 我的xml文件具有以下形式:

<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl"?>
    <?xml-stylesheet type="text/css" media="screen" href="some url" ?>
        <rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0">
            <channel>
                <item>
                    <pubDate>Fri, 19 Sep 2014 18:00:08 GMT</pubDate>
                    <guid isPermaLink="false">http://www.example0.com</guid>
                </item>
                <item>
                    <pubDate>Fri, 19 Sep 2014 17:52:25 GMT</pubDate>
                    <guid isPermaLink="false">http://www.example1.com</guid>
                </item>
            </channel>
        </rss>

I want to get all the urls under <item><guid isPermaLink="false"> as an array. 我想将<item><guid isPermaLink="false">下的所有URL作为数组获取。

I'm trying out the code, but it is for a locally stored xml file. 我正在尝试代码,但是它用于本地存储的xml文件。 Also, I'm unable to get the urls.: 另外,我无法获取网址。:

var fs = require('fs'),
    xml2js = require('xml2js');

var parser = new xml2js.Parser();
parser.addListener('end', function(result) {
    console.dir(result);
    console.log('Done.');
});
fs.readFile(__dirname + '/foo.xml', function(err, data) {
    parser.parseString(data);
});

You can use the sax-js module to extract URLs you need. 您可以使用sax-js模块提取所需的URL。 The module you mentioned uses sax-js internally. 您提到的模块在内部使用sax-js

Here is the code (rough cuts): 这是代码(粗略):

'use strict';

var sax = require('sax');
var fs = require('fs');

var filePath = __dirname + '/' + 'foo.xml';
var isTextPending = false;

var saxStream = sax.createStream(true);
saxStream.on('error', function (e) {
  console.error(e);
});

saxStream.ontext = function (text) {
  if(isTextPending) {
    console.log(text);
    isTextPending = false;
  }
};

saxStream.on('opentag', function (node) {
  if(node.name === 'guid' && node.attributes.isPermaLink === 'false') {
    isTextPending = true;
  }
});

fs.createReadStream(filePath)
  .pipe(saxStream);

And the output is: 输出为:

http://www.example0.com
http://www.example1.com

UPD: UPD:

To fetch XML from the internet to process it, use the request module: 要从互联网上获取XML进行处理,请使用请求模块:

var request = require('request');

var href = 'http://SOME_URL.xml';

request(href)
  .pipe(saxStream);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM