简体   繁体   English

如何仅从 HTML 文档中提取粗体文本?

[英]How do I extract only the bold text from an HTML document?

I need to extract all the bold snippets in the body of an HTML document.我需要提取 HTML 文档正文中的所有粗体片段。 I need to do it on server side using Java (not on the browser)我需要使用 Java 在服务器端执行此操作(而不是在浏览器上)

The text on the page can be bold because of tags eg <b> , <h1> , etc., or because of inline CSS styling style="font-weight:bold;"由于<b><h1>等标签或内联 CSS 样式style="font-weight:bold;" ,页面上的文本可以是粗体的, or because of external CSS styling using CSS clases. ,或者因为使用 CSS 类的外部 CSS 样式。

I am using Jsoup, but I can use any other library as well to get this done.我正在使用 Jsoup,但我也可以使用任何其他库来完成这项工作。

Thanks for your time!谢谢你的时间!

A plain JavaScript solution: On sufficiently new browsers, you can use the getPropertyValue method to retrieve the computed style of an element.一个普通的 JavaScript 解决方案:在足够新的浏览器上,您可以使用getPropertyValue方法来检索元素的计算样式。 You can traverse the document tree and check all text nodes;您可以遍历文档树并检查所有文本节点; text nodes do not have style, so you need to check their parents:文本节点没有样式,因此您需要检查它们的父节点:

function consume(string) {
  console.log(string);
}
function traverse(tree) {
  var i;
  if(tree.nodeType === 3) {
    if(getComputedStyle(tree.parentNode).getPropertyValue('font-weight') === 'bold') {
      consume(tree.textContent);
    }
  }
  for(i = 0; i < tree.childNodes.length; i++) {
    traverse(tree.childNodes[i]);
  }
}
traverse(document.body);

Replace consume by your own function that processes the bold texts.用您自己的处理粗体文本的函数替换consume

It seems that the computed value of font-weight is bold even when declared as 700 .即使声明为700font-weight的计算值似乎也是bold

Note that this will only pick up text for which font weight is set specifically to bold (700).请注意,这只会选取字体粗细设置为粗体 (700) 的文本。 Elements with a computed font weight of 600, 800, or 900 will most probably appear in bold (depending on availability of typefaces of course).计算字体粗细为 600、800 或 900 的元素很可能以粗体显示(当然取决于字体的可用性)。 They could be covered by making an obvious modification to the test.可以通过对测试进行明显的修改来覆盖它们。

you can use getElementsByTagName()你可以使用getElementsByTagName()

http://www.w3schools.com/jsref/met_doc_getelementsbytagname.asp http://www.w3schools.com/jsref/met_doc_getelementsbytagname.asp

also, can be useful querySelectorAll也可以是有用的querySelectorAll

https://developer.mozilla.org/en-US/docs/DOM/Document.querySelectorAll https://developer.mozilla.org/en-US/docs/DOM/Document.querySelectorAll

Good luck, Daniel祝你好运,丹尼尔

For the tags and inline style (eg style directly added to html, not contained in an external css stylesheet), you could go with the css selectors link .对于标签和内联样式(例如,样式直接添加到 html,不包含在外部 css 样式表中),您可以使用 css selectors link (for the inline style, it would be [style*="font-weight:bold;"] ). (对于内联样式,它将是[style*="font-weight:bold;"] )。

Simply grab the element by tag name, and loop through:只需按标签名称抓取元素,然后循环:

elem = document.getElementsByTagName("b");

for(i=0;i<elem.length;i++) {
    console.log(elem[i].innerText)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM