[英]Split string into words in javascript
At the moment i am working on text that is broken into floating columns to display it in a magazine-like
way. 目前,我正在处理分解为浮动列的文本,以
magazine-like
方式显示它。
I asked in a previous question how to split
the text into sentences and it works like a charm: 我在上一个问题中问过如何将文本
split
为句子,它的工作原理就像一个咒语:
sentences = text.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");
Now i want to go a step further and split it into words. 现在,我想更进一步,将其分解为文字。 But i do also have some elements in it, that should not be splitted.
但我确实也有一些内容,不应拆分。 Like subheadlines.
就像副标题一样。
An example text would be: 示例文本为:
A wonderful serenity has taken possession of my entire soul. <strong>This is a subheadline</strong><br><br>I am alone, and feel the charm of existence in this spot.
My desired result would look like the following: 我期望的结果如下所示:
Array [
"A",
"wonderful",
"serenity",
"has",
"taken",
"possession",
"of",
"my",
"entire",
"soul.",
"<strong>This is a subheadline</strong>",
"<br>",
"<br>",
"I",
"am",
"alone,",
"and",
"feel",
"the",
"charm",
"of",
"existence",
"in",
"this",
"spot."
]
When i split at all whitespaces i do get the words, but the "<br>"
won't be added as a new array entry. 当我在所有空格处分割时,会得到单词,但不会将
"<br>"
添加为新的数组条目。 I also don't want to split the subheadline and markup. 我也不想拆分副标题和标记。
The reason why i want to do this, is that i add sequence after sequence to a p-tag and when the height gets bigger than the surrounding element i remove the last added sequence and create a new floating p-tag. 我要执行此操作的原因是,我将序列后的序列添加到p标签中,并且当高度大于周围的元素时,我会删除最后添加的序列并创建一个新的浮动p标签。 When i splitted it into sentences i saw, that the breakup was not good enough to ensure a good reading flow.
当我将其拆分为句子时,我发现分手不足以确保良好的阅读流程。
An example what i try to achieve can you see here 我试图达到的一个例子可以在这里看到
If you need any further information i will be glad to give it to you. 如果您需要任何进一步的信息,我将很高兴为您提供。
Thanks in advance, 提前致谢,
Tobias 托比亚斯
EDIT 编辑
The string could contain more html tags in the future. 该字符串将来可能包含更多的html标签。 Is there a way to not touch anything between these tags?
有没有办法在这些标签之间不碰任何东西?
EDIT 2 编辑2
I created a jsfiddle: http://jsfiddle.net/m9r9q/1/ 我创建了一个jsfiddle: http : //jsfiddle.net/m9r9q/1/
EDIT 3 编辑3
Would it be a good idea to remove all html tags with encapsulated text and replace it with placeholders? 删除所有带有封装文本的html标签并将其替换为占位符是个好主意吗? Then split the string into words and add the untouched html-tags when the placeholder is reached?
然后将字符串拆分成单词,并在到达占位符时添加未修饰的html标签? What would be the regex to extract all html tags?
提取所有html标签的正则表达式是什么?
As I stated before in comment - you shouldn't do this. 正如我之前在评论中所述-您不应该这样做。 But if you insist - here's a possible answer:
但是,如果您坚持-这是一个可能的答案:
var text = 'A wonderful serenity has taken possession of my entire soul. <strong>This is a subheadline</strong><br><br>I am alone, and feel the charm of existence in this spot.';
var array = [],
tagOpened = false,
stringBuilder = [];
text.replace(/(<([^\s>]*)[^>]*>|\b[^\s<]*)\s*/g, function(all, word, tag) {
if (tag) {
var closing = tag[0] == '/';
if (closing) {
stringBuilder.push(all);
word = stringBuilder.join('');
stringBuilder = [];
tagOpened = false;
} else {
tagOpened = tag.toLowerCase() != 'br';
}
}
if (tagOpened) {
stringBuilder.push(all);
} else {
array.push(word);
}
return '';
});
if (stringBuilder.length) array.push(stringBuilder.join(''));
It doesn't support nested tags. 它不支持嵌套标签。 You can add this functionality by implementing a stack for your opened tags
您可以通过为打开的标签实现堆栈来添加此功能
Although i want to try to extract the html parts and add them afterwards untouched
虽然我想尝试提取html部分,然后再添加它们
Forget about it and about my previous post. 忘记它和我以前的帖子。 I just got an idea that it's much better to use built in browser engine to operate on html code.
我只是想到,最好使用内置的浏览器引擎对html代码进行操作。
You can just use this: 您可以使用以下命令:
var text = 'A wonderful serenity has taken possession of my entire soul. <strong>This is a subheadline</strong><br><br>I am alone, and feel the charm of existence in this spot.';
var elem = document.createElement('div');
elem.innerHTML = text;
var array = [];
for(var i = 0, childs = elem.childNodes; i < childs.length; i ++) {
if (childs[i].nodeType === 3 /* document.TEXT_NODE */) {
array = array.concat(childs[i].nodeValue.trim().split(/\s+/));
} else {
array.push(childs[i].outerHTML);
}
}
It DOES support nested tags this time, also it supports all possible syntax without hard-coded exceptions for non closable tags :) 这次它确实支持嵌套标签,它也支持所有可能的语法,而对于不可关闭的标签则没有硬编码的异常:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.