[英]Extract a string from HTML with NodeJS
Here is the html...这是html...
<iframe width="100%" height="166" scrolling="no" frameborder="no"
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&auto_play=false
&show_artwork=true&color=c3000d&show_comments=false&liking=false
&download=false&show_user=false&show_playcount=false"></iframe>
I'm using NodeJS.我正在使用 NodeJS。 I'm trying to extract the trackID, in this case
11111111
following tracks%2F
.我正在尝试提取 trackID,在本例中为
11111111
跟随 tracking tracks%2F
。 What is the most stable method for performing this?执行此操作的最稳定方法是什么?
Should I use regex or some JS string method such as substring()
or match()
?我应该使用正则表达式还是一些 JS 字符串方法,例如
substring()
或match()
?
If you know tracks%2F
is only going to show up once you could do:如果您知道
tracks%2F
只会出现一次,您可以执行以下操作:
var your_track_ID = src.split(/tracks%2F/)[1].split(/&/)[0];
There are probably better ways, but that should work fine for your purposes.可能有更好的方法,但这应该适合您的目的。
It's generally a terribly bad idea to parse HTML with a regular expression, but this might be forgivable.使用正则表达式解析 HTML 通常是一个非常糟糕的主意,但这可能是可以原谅的。 I'd look for the complete URL for safety:
为了安全起见,我会寻找完整的 URL:
var pattern = /w\.soundcloud\.com.*tracks%2F(\d+)&/
, trackID = (html.match(pattern) || [])[1]
You can find tracks with node module [url + jsdom + qs]您可以使用节点模块 [url + jsdom + qs] 找到曲目
Try this试试这个
var jsdom = require('jsdom');
var url = require('url');
var qs = require('qs');
var str = '<iframe width="100%" height="166" scrolling="no" frameborder="no"'
+ 'src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&auto_play=false"'
+ '&show_artwork=true&color=c3000d&show_comments=false&liking=false'
+ '&download=false&show_user=false&show_playcount=false"></iframe>';
jsdom.env({
html: str,
scripts: [
'http://code.jquery.com/jquery-1.5.min.js'
],
done: function(errors, window) {
var $ = window.$;
var src = $('iframe').attr('src');
var aRes = qs.parse(decodeURIComponent(url.parse(src).query)).url.split('/');
var track_id = aRes[aRes.length-1];
console.log("track_id =", track_id);
}
});
The result is:结果是:
track_id = 11111111
track_id = 11111111
Update for 2019... 2019 年更新...
This builds off of blueiur's answer and walks through a solution in more detail.这建立在 blueiur 的答案之上,并更详细地介绍了一个解决方案。
JSDOM
needs to be installed before you can use it: JSDOM
需要安装后才能使用:
npm install jsdom
Now, according to the documentation , you can instantiate JSDOM
like this:现在,根据文档,您可以像这样实例化
JSDOM
:
const jsdom = require('jsdom');
const { JSDOM } = jsdom;
You've already got some html you want to parse, I'll use your example and define it as a template literal:您已经有一些要解析的 html,我将使用您的示例并将其定义为模板文字:
const data = `<iframe width="100%" height="166" scrolling="no" frameborder="no"
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&auto_play=false
&show_artwork=true&color=c3000d&show_comments=false&liking=false
&download=false&show_user=false&show_playcount=false"></iframe>`;
Here's the fun part... parse the html in NodeJS:这是有趣的部分……在 NodeJS 中解析 html:
const { document } = (new JSDOM(data)).window;
What's happening here?这里发生了什么事? You're creating a new JSDOM object with the provided HTML and grabbing the
document
attribute of the window
attribute.您正在使用提供的 HTML 创建一个新的 JSDOM 对象并获取
window
属性的document
属性。 From this point on, you can use document.getElementsByTagName()
and other similar functions just like you would in a browser.从现在开始,您可以像在浏览器中一样使用
document.getElementsByTagName()
和其他类似的函数。
To continue with your specific example, you want to extract the src
attribute of the only iframe
in the document.要继续您的特定示例,您需要提取文档中唯一
iframe
的src
属性。 There are multiple ways to do that.有多种方法可以做到这一点。 One example is to use
getElementsByTagName
to pull the first iframe
like this:一个例子是使用
getElementsByTagName
像这样拉出第一个iframe
:
const src1 = document.getElementsByTagName('iframe')[0].src;
Now that we have the src
attribute, we can split it apart and process the url
query value.现在我们有了
src
属性,我们可以将其拆分并处理url
查询值。 This is where we will use the URL
class which comes with NodeJS.这是我们将使用 NodeJS 附带的
URL
类的地方。 According to the documentation , we can get the search parameters by creating a URL object and accessing the searchParams
attribute like this:根据文档,我们可以通过创建一个 URL 对象并访问
searchParams
属性来获取搜索参数,如下所示:
const params = (new URL(src1)).searchParams;
Now you've got the query string as a URLSearchParams
object and you can access individual terms like this:现在您已将查询字符串作为
URLSearchParams
对象,您可以像这样访问单个术语:
const scURL = params.get('src');
If you look at the contents of scURL
now, you'll find it is the embedded url which was passed as a query, so we can parse that with another URL
object and extract the pathname
attribute like this:如果您现在查看
scURL
的内容,您会发现它是作为查询传递的嵌入 url,因此我们可以使用另一个URL
对象解析它并提取pathname
属性,如下所示:
const src2 = (new URL(src2)).pathname;
We're getting close now, and can split the path apart to the get value you wanted using JavaScript's standard string functions:我们现在已经接近了,并且可以使用 JavaScript 的标准字符串函数将路径拆分为您想要的 get 值:
const val = src2.split('/')[2];
And print the result:并打印结果:
console.log(val);
... which produces this output: ...产生此输出:
11111111
To summarize, here is the complete code:总而言之,这里是完整的代码:
const jsdom = require('jsdom');
const { JSDOM } = jsdom;
const data = `<iframe width="100%" height="166" scrolling="no" frameborder="no"
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&auto_play=false
&show_artwork=true&color=c3000d&show_comments=false&liking=false
&download=false&show_user=false&show_playcount=false"></iframe>`;
const { document } = (new JSDOM(data)).window;
const src1 = document.getElementsByTagName('iframe')[0].src;
const params = (new URL(src1)).searchParams;
const scURL = params.get('src');
const src2 = (new URL(src2)).pathname;
const val = src2.split('/')[2];
console.log(val);
Feel free to consolidate that and eliminate intermediate values as desired.随意巩固它并根据需要消除中间值。
If the track id is always 8 digits and the html doesn't change you can do this:如果轨道 ID 始终为 8 位数字并且 html 没有更改,您可以执行以下操作:
var trackId = html.match(/\\d{8}/) var trackId = html.match(/\\d{8}/)
The Right™ way to to do this is to parse the HTML using some XML parser and get the URL that way and then use a reg-exp to parse the URL.执行此操作的 Right™ 方法是使用一些XML 解析器解析 HTML 并以这种方式获取 URL,然后使用 reg-exp 解析 URL。
If for some reasons you don't have an infinite amount of time and energy, one of the proposed purely reg-exp solutions would work.如果由于某些原因您没有无限的时间和精力,建议的纯正则表达式解决方案之一将起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.