简体   繁体   English

正则表达式以识别单词之间的分隔符

[英]Regular expression to identify separators between words

I am trying to separate words in a text. 我正在尝试将文本中的单词分开。 I need to split them by anything between them so I wrote a regular expression that works almost as it should. 我需要用它们之间的任何东西对它们进行拆分,因此我编写了一个几乎可以正常工作的正则表达式。

Words are alphabetic strings that can contain dashes (-), they cannot start with dashes or end with dashes. 单词是字母字符串,可以包含破折号(-),它们不能以破折号开头或以破折号结尾。 Words cannot contain numerals or any other character besides single dashes and [a-zA-Z]. 单词除单破折号和[a-zA-Z]外,不能包含数字或任何其他字符。

This is what I came up with so far: 到目前为止,这是我想出的:

/(-[^a-zA-Z])|\w*\d\w*|[^a-zA-Z-]+/ig

This, however, does not work correctly for words starting with a dash, such as this situation: 但是,这种方法不适用于以破折号开头的单词,例如:

123-word

That should match 那应该匹配

123-

Any help on this would be greatly appreciated, thanks! 任何帮助,将不胜感激,谢谢!

Update 更新资料

Sorry, I was rather vague. 抱歉,我有点模糊。 I need to match what is between words, not the words themselves, so I can do a split into an array further on. 我需要匹配单词之间的含义,而不是单词本身,因此我可以进一步拆分成数组。

This is what matches so far with the expression above: 到目前为止,这与上面的表达式匹配: 在此处输入图片说明

... and this is how it should be like: ...这应该是这样的: 在此处输入图片说明

Notice the difference of matching at the second text line (123-) Sorry for not being specific enough. 请注意第二个文本行(123-)的匹配差异对不起,因为不够具体。

You can use this regex: 您可以使用此正则表达式:

/(?<=[^\w-]|^)(?!-)([a-z-]+)(?<!-)(?=[^\w-]|$)/gi

Given an input as follows: 输入如下:

abc-def word A -notword xyz notword-

The above regex will match following words: 上面的正则表达式将匹配以下单词:

abc-def
word
A
xyz

Working demo 工作演示


UPDATE: Based on edited question you can use this regex for splitting: 更新:根据已编辑的问题,您可以使用此正则表达式进行拆分:

/([^\w-].*?)(?=(?<=[^\w-]|^)(?!-)[a-z-]+(?<!-)(?=[^\w-]|$))/gis

Working demo 工作演示

If I understood your question correctly. 如果我正确理解您的问题。

Instead of searching for the valid matches, what you want, I replaced all invalid matches. 我想要的不是取代有效的匹配,而是替换了所有无效的匹配。

Have a look at this Demo It is matching all invalid matches according to your question, what I have understood. 看看这个演示,它根据您的问题匹配所有无效匹配,据我所知。

"Words are alphabetic strings that can contain dashes (-), they cannot start with dashes or end with dashes. Words cannot contain numerals or any other character besides single dashes and [a-zA-Z]." “单词是字母字符串,可以包含破折号(-),它们不能以破折号开头或以破折号结尾。单词不能包含数字或除单个破折号和[a-zA-Z]之外的任何其他字符。”

This is the Code 这是代码

var str = 'word word-ed, [word-ing] 123-word w-word, word-. w0rd w14rd 124eword 1234word finished.'
str.replace(/(\b[\d]+-[a-zA-Z]+\b)|(\b[\d]+[a-zA-Z]+)|(\b[a-zA-Z]+[\d]+[a-zA-Z]+)|(\b[a-zA-Z]+-[.,]|([\[\],.]))/g, '').split(/\s+/)

Output 输出量

["word", "word-ed", "word-ing", "w-word", "finished"]

Explanation: 说明:

Search for Invalid matches 搜索无效的匹配项

str.match(/(\b[\d]+-[a-zA-Z]+\b)|(\b[\d]+[a-zA-Z]+)|(\b[a-zA-Z]+[\d]+[a-zA-Z]+)|(\b[a-zA-Z]+-[.,]|([\[\],.]))/g)
//output
[",", "[", "]", "123-word", ",", "word-.", "w0rd", "w14rd", "124eword", "1234word", "."]

Replace with null 替换为null

var temp = str.replace(/(\b[\d]+-[a-zA-Z]+\b)|(\b[\d]+[a-zA-Z]+)|(\b[a-zA-Z]+[\d]+[a-zA-Z]+)|(\b[a-zA-Z]+-[.,]|([\[\],.]))/g)
//output
"word word-ed word-ing  w-word      finished"

split the result with spaces 用空格分割结果

temp.split(/\s+/)
//output
["word", "word-ed", "word-ing", "w-word", "finished"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM