使用正则表达式从字符串列表中提取特定信息

Question

I have a string list with several thousands of URL values in different structures and I am trying to use regex to extract specific information from the URL values. 我有一个字符串列表，其中包含成千上万个具有不同结构的URL值，并且我试图使用正则表达式从URL值中提取特定信息。 The following gives you an example URL from which you can get an idea about the structure of this specific URL (note that there are many other records in this format, only the numbers changes across the data): 下面提供了一个示例URL，您可以从中获得有关此特定URL的结构的想法（请注意，还有许多其他格式的记录，只有数字在数据中变化）：

url_id | url_text
15     | /course/123908/discussion_topics/394785/entries/980389/read

Using the re library in python I can find which URLs have this structure: 使用python中的re库，我可以找到具有以下结构的URL：

re.findall(r"/course/\d{6}/discussion_topics/\d{6}/entries/\d{6}/read", text)

However, I also need to extract the '394785' and '980389' values and create a new matrix that may look like this: 但是，我还需要提取'394785'和'980389'值并创建一个新的矩阵，如下所示：

url_id | topic_394785 | entry_980389 | {other items will be added as new column}
15     | 1            | 1            | 0       | 0     | 1    | it goes like this

Can someone help me in extracting this specific info? 有人可以帮助我提取此特定信息吗？ I know that 'split' method of 'str' could be an option. 我知道'str'的'split'方法可能是一个选择。 But, I wonder if there is a better solution. 但是，我想知道是否有更好的解决方案。

Thanks! 谢谢！

Answer 1

Do you mean something like this? 你的意思是这样吗？

import re

text = '/course/123908/discussion_topics/394785/entries/980389/read'
pattern = r"/course/\d{6}/discussion_topics/(?P<topic>\d{6})/entries/(?P<entry>\d{6})/read"

for match in re.finditer(pattern, text):
    topic, entry  = match.group('topic'), match.group('entry')
    print('Topic ID={}, entry ID={}'.format(topic, entry))

Output 产量

Topic ID=394785, entry ID=980389

使用正则表达式从字符串列表中提取特定信息

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-01-17 12:43:25

使用正则表达式从字符串列表中提取特定信息

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-01-17 12:43:25

解决方案1
2 已采纳 2017-01-17 12:43:25