简体   繁体   English

通过HTML实体拆分字符串?

[英]Split string by HTML entities?

My string contain a lot of HTML entities, like this 我的字符串包含很多HTML实体,像这样

"Hello <everybody> there" "你好 <每个人> 还有"

And I want to split it by HTML entities into this : 我想按HTML实体将其拆分为以下内容:

Hello 你好
everybody 每个人
there 那里

Can anybody suggest me a way to do this please? 有人可以建议我这样做吗? May be using Regex? 可能正在使用正则表达式?

It looks like you can just split on &[^;]*; 看来您可以仅拆分&[^;]*; regex. 正则表达式。 That is, the delimiter are strings that starts with & , ends with ; 也就是说,分隔符是以&开头,以;结束的字符串; , and in between there can be anything but ; 两者之间除了可以有任何东西; .

If you can have multiple delimiters in a row, and you don't want the empty strings between them, just use (&[^;]*;)+ (or in general ( delim )+ pattern). 如果您可以连续使用多个定界符,并且不想在它们之间使用空字符串,则只需使用(&[^;]*;)+ (或通常的( delim )+模式)即可。

If you can have delimiters in the beginning or front of the string, and you don't want them the empty strings caused by them, then just trim them away before you split. 如果在字符串的开头或开头可以有定界符,并且您不希望它们由它们引起的空字符串,那么只需在分割之前将它们修剪掉即可。


Example

Here's a snippet to demonstrate the above ideas ( see also on ideone.com ): 以下是演示上述想法的代码段( 另请参见ideone.com ):

var s = ""Hello <everybody> there""

print (s.split(/&[^;]*;/));
// ,Hello,,everybody,,there,

print (s.split(/(?:&[^;]*;)+/));
// ,Hello,everybody,there,

print (
   s.replace(/^(?:&[^;]*;)+/, "")
    .replace(/(?:&[^;]*;)+$/, "")
    .split(/(?:&[^;]*;)+/)
);
// Hello,everybody,there

var a = str.split(/\\&[#a-z0-9]+\\;/); should do it, although you'll end up with empty slots in the array when you have two entities next to each other. 应该这样做,尽管当两个实体彼此相邻时,您最终会在数组中留下空插槽。

split(/&.*?;(?=[^&]|$)/)

并剪切最后一个和第一个结果:

["", "Hello", "everybody", "there", ""]
>> ""Hello <everybody> there"".split(/(?:&[^;]+;)+/)
['', 'Hello', 'everybody', 'there', '']

The regex is: /(?:&[^;]+;)+/ 正则表达式为: /(?:&[^;]+;)+/

Matches entities as & followed by 1+ non- ; 匹配实体为&后跟1+ non- ; characters, followed by a ; 字符,后跟一个; . Then matches at least one of those (or more) as the split delimiter. 然后,将至少一个(或多个)匹配作为分割定界符。 The (?:expression) non-capturing syntax is used so that the delimiters captured don't get put into the result array ( split() puts capture groups into the result array if they appear in the pattern). 使用(?:expression)非捕获语法,这样捕获的定界符就不会放入结果数组中( split()如果捕获组出现在模式中,则会将它们放入结果数组中)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM