简体   繁体   English

根据字体和字体大小刮取网页

[英]Scraping a web page based on fonts and font-size

HTML text scraping is doable with various libraries that can be found in the web . HTML文本搜索可以在Web中找到各种库。 I am trying to parse the biggest heading (title) of a web page - only that - from various HTML pages. 我试图从各种HTML页面解析网页的最大标题(标题) - 仅限于此。

I am trying to automatically detect the main title of the item from several hundred pages (it can be a product page or an article page etc.). 我试图从几百页(它可以是产品页面或文章页面等)自动检测项目的主标题。 It would be great if there was a way to make my parsing decision based on the font and the font size of the text that is available in the web page. 如果有一种方法可以根据网页中可用文本的字体和字体大小做出解析决定,那就太好了。 Since the main title is almost always the text with the biggest font in the web page, this information can give a me a lot of insight about where to find the title. 由于主标题几乎总是网页中字体最大的文本,因此这些信息可以让我对如何找到标题提供很多见解。

So the questions is, is there any way that this can be accomplished? 所以问题是,有什么办法可以实现这一目标吗?

I suppose you could do it like this , but this is a very resource intensive task because you iterate over all html elements in the body. 你可以做到这样的 ,不过这是因为你遍历体内所有的HTML元素一个非常耗费资源的任务。

var text,
    size = 0;

$("body, body *").each(function() {
    var f_size = parseInt($(this).css("fontSize"));
    if (size<f_size) {
        text = $(this).text();
        size = f_size;
    }
    console.log(this.tagName + " " + f_size);
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM