简体   繁体   English

通过正则表达式分割字符串

[英]Split string via regular expression

Suppose I am given a string like: 假设给我一个像这样的字符串:

input = """
abc@gmail.com is a very nice person
xyz@gmail.com sucks
lol@gmail.com is pretty funny."""

I have a regular expression for email addresses: ^[A-z0-9\\+\\.]+\\@[A-z0-9\\+\\.]+\\.[A-z0-9\\+]+$ 我有一个用于电子邮件地址的正则表达式: ^[A-z0-9\\+\\.]+\\@[A-z0-9\\+\\.]+\\.[A-z0-9\\+]+$

The goal is to split the string based on the email address regular expression. 目标是根据电子邮件地址正则表达式拆分字符串。 The output should be: 输出应为:

["is a very nice person", "sucks", "is pretty funny."]

I have been trying to use re.split(EMAIL_REGEX, input) but i haven't been successful. 我一直在尝试使用re.split(EMAIL_REGEX, input)但没有成功。 I get the output as the entire string contained in the list. 我得到的输出是列表中包含的整个字符串。

Remove the ^ and $ anchors, as they only match the beginning and end of the string. 删除^$锚点,因为它们仅匹配字符串的开头和结尾。 Since the email addresses are in the middle of the string, they'll never match. 由于电子邮件地址位于字符串的中间,因此它们将永远不会匹配。

Your regexp has other problems. 您的正则表达式还有其他问题。 The account name can contain many other characters than the ones you allow, eg _ and - . 帐户名称中可以包含许多其他字符,例如_- The domain name can contain - characters, but not + . 域名可以包含-字符,但不能包含+ And you shouldn't use the range Az to get upper and lower case characters, because there are characters between the two alphabetic blocks that you probably don't want to include (see the ASCII Table ); 并且您不应该使用范围Az来获取大写和小写字符,因为您可能不想在两个字母块之间包含一些字符(请参见ASCII表 )。 either use A-Za-z or use az and add flags = re.IGNORECASE . 使用A-Za-z或使用az并添加flags = re.IGNORECASE

The '^$' might be throwing it off. '^$'可能会把它扔掉。 It'll only match string that starts and ends with the matching regex. 它只会匹配以匹配的正则表达式开头和结尾的字符串。

I have something close to what you want: 我有一些接近您想要的东西:

>>> EMAIL_REGEX = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> re.split(EMAIL_REGEX, input, flags=re.IGNORECASE)
['\n', ' is a very nice person\n', ' sucks\n', ' is pretty funny.']

You will probably need to loop through the lines and then split each with your regex. 您可能需要遍历所有行,然后使用正则表达式将它们分开。 Also your regex shouldn't have $ at the end. 另外,您的正则表达式结尾不应包含$

Try something like: 尝试类似:

EMAIL_REGEX = r"\.[a-z]{3} " # just for the demo note the space
ends =[]
for L in input.split("\n"):
   parts = re.split(EMAIL_REGEX,L)
   if len(parts) > 1:
       ends.append(parts[1])

Output: 输出:

['is a very nice person', 'sucks', 'is pretty funny.']

Wouldn't use a regex here, it would work like this as well: 这里不会使用正则表达式,它也将像这样工作:

messages = [] for item in input.split('\n'): item = ' '.join(item.split(' ')[1:]) #removes everything before the first space, which is just the email-address in this case messages.append(item)

Output of messages when using: 使用时的messages输出:

input = """ abc@gmail.com is a very nice person xyz@gmail.com sucks lol@gmail.com is pretty funny."""
['', 'is a very nice person', 'sucks', 'is pretty funny.']

If you want to remove the first element, just do it like this: messages = messages[1:] 如果要删除第一个元素,请按照以下步骤操作: messages = messages[1:]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM