[英]Extracting specific information from a string list using regular expressions
I have a string list with several thousands of URL values in different structures and I am trying to use regex to extract specific information from the URL values. 我有一个字符串列表,其中包含成千上万个具有不同结构的URL值,并且我试图使用正则表达式从URL值中提取特定信息。 The following gives you an example URL from which you can get an idea about the structure of this specific URL (note that there are many other records in this format, only the numbers changes across the data):
下面提供了一个示例URL,您可以从中获得有关此特定URL的结构的想法(请注意,还有许多其他格式的记录,只有数字在数据中变化):
url_id | url_text
15 | /course/123908/discussion_topics/394785/entries/980389/read
Using the re
library in python I can find which URLs have this structure: 使用python中的
re
库,我可以找到具有以下结构的URL:
re.findall(r"/course/\d{6}/discussion_topics/\d{6}/entries/\d{6}/read", text)
However, I also need to extract the '394785' and '980389' values and create a new matrix that may look like this: 但是,我还需要提取'394785'和'980389'值并创建一个新的矩阵,如下所示:
url_id | topic_394785 | entry_980389 | {other items will be added as new column}
15 | 1 | 1 | 0 | 0 | 1 | it goes like this
Can someone help me in extracting this specific info? 有人可以帮助我提取此特定信息吗? I know that 'split' method of 'str' could be an option.
我知道'str'的'split'方法可能是一个选择。 But, I wonder if there is a better solution.
但是,我想知道是否有更好的解决方案。
Thanks! 谢谢!
Do you mean something like this? 你的意思是这样吗?
import re
text = '/course/123908/discussion_topics/394785/entries/980389/read'
pattern = r"/course/\d{6}/discussion_topics/(?P<topic>\d{6})/entries/(?P<entry>\d{6})/read"
for match in re.finditer(pattern, text):
topic, entry = match.group('topic'), match.group('entry')
print('Topic ID={}, entry ID={}'.format(topic, entry))
Output 产量
Topic ID=394785, entry ID=980389
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.