使用 textContent 收集文本时如何排除伪元素？

Question

I collect text from an HTML file using the textContent method.我使用textContent方法从 HTML 文件中收集文本。 I beliefe that the pseudo element 我相信伪元素 is copied as well since I cannot replace words that contain this element.也被复制了，因为我无法替换包含此元素的单词。 All words that contain 所有包含的单词(which is not visible) cannot be replaced with the actual word. （这是不可见的）不能用实际的词代替。 I tried to first replace %shy;我试着先替换%shy; using .replace((/g, "") but it will still not work.使用.replace((/g, "")但它仍然不起作用。

Example:例子：

I cannot replace "eftersom" using .replace(/eftersom/g, "???") As said the element is not visible after collecting it with .textContent , but it seems to be there.我无法使用.replace(/eftersom/g, "???")替换"eftersom"如前所述，该元素在用.textContent收集后不可见，但它似乎在那里。

I tried multiple regular expressions like:我尝试了多个正则表达式，例如：

.replace(new RegExp(`(\\W)(${firstWord.replace(/&shy;/gi, "")})(\\W)`, "gi"), "$1???$3")

where firstWord is a variable.其中firstWord是一个变量。

Answer 1

Try this out and see if it works - this should remove all the 试试这个，看看它是否有效——这应该会删除所有的 s on your page:在你的页面上：

console.log(document.body.innerHTML.replace(/\u00AD/g, ''));

This works by by searching for the Unicode character U+00AD.这通过搜索 Unicode 字符 U+00AD 来实现。

Answer 2

If the previous answer didn't work try using this one, which includes the &shy and the decimal version of the soft-hyphen (&#173).如果上一个答案不起作用，请尝试使用这个答案，其中包括 &shy 和软连字符 (&#173) 的十进制版本。

.replace(/(\&shy;||&#173;)/gi, "");

This have been answered before in this question.这个问题之前已经回答过了。 Remove  删除  (soft hyphen) entity from element （软连字符）来自元素的实体

使用 textContent 收集文本时如何排除伪元素？

问题描述

2 个解决方案

解决方案1
0 2023-01-03 16:14:02

解决方案2
0 2023-01-03 16:20:26

使用 textContent 收集文本时如何排除伪元素？

问题描述

2 个解决方案

解决方案1 0 2023-01-03 16:14:02

解决方案2 0 2023-01-03 16:20:26

解决方案1
0 2023-01-03 16:14:02

解决方案2
0 2023-01-03 16:20:26