简体   繁体   English

使用 NodeJS 从 HTML 中提取字符串

[英]Extract a string from HTML with NodeJS

Here is the html...这是html...

<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>

I'm using NodeJS.我正在使用 NodeJS。 I'm trying to extract the trackID, in this case 11111111 following tracks%2F .我正在尝试提取 trackID,在本例中为11111111跟随 tracking tracks%2F What is the most stable method for performing this?执行此操作的最稳定方法是什么?

Should I use regex or some JS string method such as substring() or match() ?我应该使用正则表达式还是一些 JS 字符串方法,例如substring()match()

If you know tracks%2F is only going to show up once you could do:如果您知道tracks%2F只会出现一次,您可以执行以下操作:

var your_track_ID = src.split(/tracks%2F/)[1].split(/&amp/)[0];

There are probably better ways, but that should work fine for your purposes.可能有更好的方法,但这应该适合您的目的。

It's generally a terribly bad idea to parse HTML with a regular expression, but this might be forgivable.使用正则表达式解析 HTML 通常是一个非常糟糕的主意,但这可能是可以原谅的。 I'd look for the complete URL for safety:为了安全起见,我会寻找完整的 URL:

var pattern = /w\.soundcloud\.com.*tracks%2F(\d+)&/
  , trackID = (html.match(pattern) || [])[1]

You can find tracks with node module [url + jsdom + qs]您可以使用节点模块 [url + jsdom + qs] 找到曲目

Try this试试这个

var jsdom = require('jsdom');
var url = require('url');
var qs = require('qs');

var str = '<iframe width="100%" height="166" scrolling="no" frameborder="no"'
  + 'src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false"'
  + '&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false'
  + '&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>';

jsdom.env({
  html: str,
  scripts: [
    'http://code.jquery.com/jquery-1.5.min.js'
  ],
  done: function(errors, window) {
    var $ = window.$;
    var src = $('iframe').attr('src');
    var aRes = qs.parse(decodeURIComponent(url.parse(src).query)).url.split('/');
    var track_id = aRes[aRes.length-1];

    console.log("track_id =", track_id);
  }
});

The result is:结果是:

track_id = 11111111 track_id = 11111111

Update for 2019... 2019 年更新...

This builds off of blueiur's answer and walks through a solution in more detail.这建立在 blueiur 的答案之上,并更详细地介绍了一个解决方案。 JSDOM needs to be installed before you can use it: JSDOM需要安装后才能使用:

npm install jsdom

Now, according to the documentation , you can instantiate JSDOM like this:现在,根据文档,您可以像这样实例化JSDOM

const jsdom = require('jsdom');
const { JSDOM } = jsdom;

You've already got some html you want to parse, I'll use your example and define it as a template literal:您已经有一些要解析的 html,我将使用您的示例并将其定义为模板文字:

const data = `<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>`;

Here's the fun part... parse the html in NodeJS:这是有趣的部分……在 NodeJS 中解析 html:

const { document } = (new JSDOM(data)).window;

What's happening here?这里发生了什么事? You're creating a new JSDOM object with the provided HTML and grabbing the document attribute of the window attribute.您正在使用提供的 HTML 创建一个新的 JSDOM 对象并获取window属性的document属性。 From this point on, you can use document.getElementsByTagName() and other similar functions just like you would in a browser.从现在开始,您可以像在浏览器中一样使用document.getElementsByTagName()和其他类似的函数。

To continue with your specific example, you want to extract the src attribute of the only iframe in the document.要继续您的特定示例,您需要提取文档中唯一iframesrc属性。 There are multiple ways to do that.有多种方法可以做到这一点。 One example is to use getElementsByTagName to pull the first iframe like this:一个例子是使用getElementsByTagName像这样拉出第一个iframe

const src1 = document.getElementsByTagName('iframe')[0].src;

Now that we have the src attribute, we can split it apart and process the url query value.现在我们有了src属性,我们可以将其拆分并处理url查询值。 This is where we will use the URL class which comes with NodeJS.这是我们将使用 NodeJS 附带的URL类的地方。 According to the documentation , we can get the search parameters by creating a URL object and accessing the searchParams attribute like this:根据文档,我们可以通过创建一个 URL 对象并访问searchParams属性来获取搜索参数,如下所示:

const params = (new URL(src1)).searchParams;

Now you've got the query string as a URLSearchParams object and you can access individual terms like this:现在您已将查询字符串作为URLSearchParams对象,您可以像这样访问单个术语:

const scURL = params.get('src');

If you look at the contents of scURL now, you'll find it is the embedded url which was passed as a query, so we can parse that with another URL object and extract the pathname attribute like this:如果您现在查看scURL的内容,您会发现它是作为查询传递的嵌入 url,因此我们可以使用另一个URL对象解析它并提取pathname属性,如下所示:

const src2 = (new URL(src2)).pathname;

We're getting close now, and can split the path apart to the get value you wanted using JavaScript's standard string functions:我们现在已经接近了,并且可以使用 JavaScript 的标准字符串函数将路径拆分为您想要的 get 值:

const val = src2.split('/')[2];

And print the result:并打印结果:

console.log(val);

... which produces this output: ...产生此输出:

11111111

To summarize, here is the complete code:总而言之,这里是完整的代码:

const jsdom = require('jsdom');
const { JSDOM } = jsdom;

const data = `<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>`;

const { document } = (new JSDOM(data)).window;

const src1 = document.getElementsByTagName('iframe')[0].src;

const params = (new URL(src1)).searchParams;

const scURL = params.get('src');

const src2 = (new URL(src2)).pathname;

const val = src2.split('/')[2];

console.log(val);

Feel free to consolidate that and eliminate intermediate values as desired.随意巩固它并根据需要消除中间值。

If the track id is always 8 digits and the html doesn't change you can do this:如果轨道 ID 始终为 8 位数字并且 html 没有更改,您可以执行以下操作:

var trackId = html.match(/\\d{8}/) var trackId = html.match(/\\d{8}/)

The Right™ way to to do this is to parse the HTML using some XML parser and get the URL that way and then use a reg-exp to parse the URL.执行此操作的 Right™ 方法是使用一些XML 解析器解析 HTML 并以这种方式获取 URL,然后使用 reg-exp 解析 URL。

If for some reasons you don't have an infinite amount of time and energy, one of the proposed purely reg-exp solutions would work.如果由于某些原因您没有无限的时间和精力,建议的纯正则表达式解决方案之一将起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM