简体   繁体   English

使用正则表达式从字符串列表中提取特定信息

[英]Extracting specific information from a string list using regular expressions

I have a string list with several thousands of URL values in different structures and I am trying to use regex to extract specific information from the URL values. 我有一个字符串列表,其中包含成千上万个具有不同结构的URL值,并且我试图使用正则表达式从URL值中提取特定信息。 The following gives you an example URL from which you can get an idea about the structure of this specific URL (note that there are many other records in this format, only the numbers changes across the data): 下面提供了一个示例URL,您可以从中获得有关此特定URL的结构的想法(请注意,还有许多其他格式的记录,只有数字在数据中变化):

url_id | url_text
15     | /course/123908/discussion_topics/394785/entries/980389/read

Using the re library in python I can find which URLs have this structure: 使用python中的re库,我可以找到具有以下结构的URL:

re.findall(r"/course/\d{6}/discussion_topics/\d{6}/entries/\d{6}/read", text) 

However, I also need to extract the '394785' and '980389' values and create a new matrix that may look like this: 但是,我还需要提取'394785'和'980389'值并创建一个新的矩阵,如下所示:

url_id | topic_394785 | entry_980389 | {other items will be added as new column}
15     | 1            | 1            | 0       | 0     | 1    | it goes like this

Can someone help me in extracting this specific info? 有人可以帮助我提取此特定信息吗? I know that 'split' method of 'str' could be an option. 我知道'str'的'split'方法可能是一个选择。 But, I wonder if there is a better solution. 但是,我想知道是否有更好的解决方案。

Thanks! 谢谢!

Do you mean something like this? 你的意思是这样吗?

import re

text = '/course/123908/discussion_topics/394785/entries/980389/read'
pattern = r"/course/\d{6}/discussion_topics/(?P<topic>\d{6})/entries/(?P<entry>\d{6})/read"

for match in re.finditer(pattern, text):
    topic, entry  = match.group('topic'), match.group('entry')
    print('Topic ID={}, entry ID={}'.format(topic, entry))

Output 产量

Topic ID=394785, entry ID=980389

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用正则表达式从字符串中提取数字 - Extracting a number from a string using regular expressions 使用正则表达式从给定的链接列表中提取pdf链接 - Extracting pdf links from given list of Links using regular expressions 使用Python / Pandas和可能的正则表达式从全名列表中提取姓氏 - Extracting the Last Name from a list of Full Names using Python / Pandas and possibly Regular Expressions 使用正则表达式提取具有特定字符串的列名 - Extracting ColumnNames with a specific String using Regular Expression 从具有不同模式的字符串中提取特定信息 - extracting specific information from string with varying patterns 使用正则表达式从简短的HTML代码段中提取一些数字 - Extracting some numbers from a short HTML snippet using regular expressions 使用正则表达式从文本文件中提取数据 - Extracting data from text file using regular expressions 使用正则表达式从列中删除字符串 - Using regular expressions to remove a string from a column 如何使用 Python 中的正则表达式捕获从字符串开头到每次出现特定字符串/模式的所有内容? - How to capture everything from the beginning of a string until every occurrence of a specific string/pattern using regular expressions in Python? 从字符串创建一个字符列表,仅使用字符串函数而不是正则表达式 - Create a List of characters from string, using only String Functions and not Regular Expressions
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM