简体   繁体   English

使用正则表达式在不同情况下拆分字符串

[英]Splitting string using different scenarios using regex

I have 2 scenarios so split a string scenario 1: 我有2个场景,所以拆分了一个字符串场景1:

"@#$hello?? getting good.<li>hii"

I want to be split as 'hello','getting','good.<li>hii (Scenario 1) 我想被拆分为'hello','getting','good.<li>hii (场景1)

'hello','getting','good','li,'hi' (Scenario 2)

Any ideas please?? 有什么想法吗?

Something like this should work: 这样的事情应该起作用:

>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[@#$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']

This might be what your looking for \\w+ it matches any digit or letter from 1 to n times as many times as possible. 这可能是您寻找\\ w +的原因,它与1到n次的任何数字或字母匹配的次数越多越好。 Here is a working Java-Script 这是一个有效的Java脚本

 var value = "@#$hello?? getting good.<li>hii"; var matches = value.match( new RegExp("\\\\w+", "gi") ); console.log(matches) 

It works by using \\w+ which matches word characters as many times as possible. 通过使用\\ w +可以尽可能多地匹配单词字符。 You cound also use [A-Za-b] to match only letters which not numbers. 您还可以使用[A-Za-b]仅匹配字母,而不匹配数字。 As show here. 如此处所示。

 var value = "@#$hello?? getting good.<li>hii777bloop"; var matches = value.match( new RegExp("[A-Za-z]+", "gi") ); console.log(matches) 

It matches what are in the brackets 1 to n timeas as many as possible. 它尽可能匹配括号1到n中的内容。 In this case the range az of lower case charactors and the range of AZ uppder case characters. 在这种情况下,小写字符的范围az和AZ大写字符的范围。 Hope this is what you want. 希望这就是你想要的。

In case you are looking for solution without regex . 如果您正在寻找不使用 regex解决方案。 string.punctuation will give you list of all special characters. string.punctuation将为您提供所有特殊字符的列表。 Use this list with list comprehension for achieving your desired result as: 将此列表与列表理解一起使用,以实现所需的结果,例如:

>>> import string
>>> my_string = '@#$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output

Explanation: Below is the step by step instruction regarding how it works: 说明:以下是有关其工作方式的逐步说明:

import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

my_string = '@#$hello?? getting good.<li>hii'

# Generating list of character in sample string with
# special character replaced with whitespace 
my_list = [(' ' if item in special_char_string else item) for item in my_string]

# Join the list to form string
my_string = ''.join(my_list)

# Split it based on space
my_desired_list = my_string.strip().split()

The value of my_desired_list will be: my_desired_list的值为:

['hello', 'getting', 'good', 'li', 'hii']

For first scenario just use regex to find all words that are contain word characters and <>. 对于第一种情况,只需使用regex查找包含单词字符和<>.所有单词<>. :

In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']

For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex: 对于第二个,仅当重复的字符不是有效的英语单词时,才需要补充这些重复的字符,可以使用nltk语料库和re.sub regex来做到这一点:

In [61]: import nltk

In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())

In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')

In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM