简体   繁体   English

在Python中重复正则表达式模式

[英]Repeating regex pattern in Python

I have a file with millions of retweets – like this: 我有一个包含数百万条转发的文件-像这样:

RT @Username: Text_of_the_tweet

I just need to extract the username from this string. 我只需要从此字符串中提取用户名。 Since I'm a total zero when it comes to regex, sometime ago here I was advised to use 由于正则表达式的总和为零,因此建议在此之前的某个时间使用

username = re.findall('@([^:]+)', retweet)

This works great for the most part, but sometimes I get lines like this: 这在大多数情况下都非常有效,但有时我会得到如下代码:

RT @ReutersAero: Further pictures from the #MH17 crash site in  in Grabovo, #Ukraine #MH17 - @reuterspictures (GRAPHIC): http://t.co/4rc7Y4…

I only need "ReutersAero" from the string, but since it contains another "@" and ":" it messes up the regex, and I get this output: 我只需要字符串中的“ ReutersAero”,但是由于它包含另一个“ @”和“:”,它使正则表达式弄乱了,我得到以下输出:

['ReutersAero', 'reuterspictures (GRAPHIC)']

Is there a way to use the regex only for the first instance it finds in the string? 有没有一种方法只能将正则表达式用于它在字符串中找到的第一个实例?

You can use a regex like this: 您可以使用以下正则表达式:

RT @(\w+):

Working demo 工作演示

在此处输入图片说明

Match information: 比赛信息:

MATCH 1
1.  [4-15]  `ReutersAero`
MATCH 2
1.  [145-156]   `AnotherAero`

You can use this python code: 您可以使用以下python代码:

import re
p = re.compile(ur'RT @(\w+):')
test_str = u"RT @ReutersAero: Further pictures from the #MH17 crash site in  in Grabovo, #Ukraine #MH17 - @reuterspictures (GRAPHIC): http://t.co/4rc7Y4…\nRT @AnotherAero: Further pictures from the #MH17 crash site in  in Grabovo, #Ukraine #MH17 - @reuterspictures (GRAPHIC): http://t.co/4rc7Y4…\n"

re.findall(p, test_str)

Is there a way to use the regex only for the first instance it finds in the string? 有没有一种方法只能将正则表达式用于它在字符串中找到的第一个实例?

Do not use findall , but search . 不要使用findall ,而是要search

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM