简体   繁体   English

使用 textContent 收集文本时如何排除伪元素?

[英]How can you exclude the pseudo element ­ when collecting text using textContent?

I collect text from an HTML file using the textContent method.我使用textContent方法从 HTML 文件中收集文本。 I beliefe that the pseudo element ­我相信伪元素­ is copied as well since I cannot replace words that contain this element.也被复制了,因为我无法替换包含此元素的单词。 All words that contain ­所有包含­的单词(which is not visible) cannot be replaced with the actual word. (这是不可见的)不能用实际的词代替。 I tried to first replace %shy;我试着先替换%shy; using .replace((­/g, "") but it will still not work.使用.replace((­/g, "")但它仍然不起作用。

Example:例子:

I cannot replace "efter­som" using .replace(/eftersom/g, "???") As said the element is not visible after collecting it with .textContent , but it seems to be there.我无法使用.replace(/eftersom/g, "???")替换"efter­som"如前所述,该元素在用.textContent收集后不可见,但它似乎在那里。

I tried multiple regular expressions like:我尝试了多个正则表达式,例如:

.replace(new RegExp(`(\\W)(${firstWord.replace(/­/gi, "")})(\\W)`, "gi"), "$1???$3")

where firstWord is a variable.其中firstWord是一个变量。

Try this out and see if it works - this should remove all the ­试试这个,看看它是否有效——这应该会删除所有的­ s on your page:在你的页面上:

console.log(document.body.innerHTML.replace(/\u00AD/g, ''));

This works by by searching for the Unicode character U+00AD.这通过搜索 Unicode 字符 U+00AD 来实现。

If the previous answer didn't work try using this one, which includes the &shy and the decimal version of the soft-hyphen (&#173).如果上一个答案不起作用,请尝试使用这个答案,其中包括 &shy 和软连字符 (&#173) 的十进制版本。

.replace(/(\­|­|­)/gi, "");

This have been answered before in this question.这个问题之前已经回答过了。 Remove ­ 删除 ­ (soft hyphen) entity from element (软连字符)来自元素的实体

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM