简体   繁体   English

使用Python正则表达式提取数据

[英]Extracting Data with Python Regular Expressions

I am having some trouble wrapping my head around Python regular expressions to come up with a regular expression to extract specific values. 我在使用Python正则表达式时遇到麻烦,想出一个正则表达式来提取特定值。

The page I am trying to parse has a number of productIds which appear in the following format 我试图解析的页面有许多productIds,它们以下列格式显示

\"productId\":\"111111\"

I need to extract all the values, 111111 in this case. 在这种情况下,我需要提取所有值, 111111

t = "\"productId\":\"111111\""
m = re.match("\W*productId[^:]*:\D*(\d+)", t)
if m:
    print m.group(1)

meaning match non-word characters ( \\W* ), then productId followed by non-column characters ( [^:]* ) and a : . 意思是匹配非单词字符( \\W* ),然后是productId后跟非列字符( [^:]* )和a : . Then match non-digits ( \\D* ) and match and capture following digits ( (\\d+) ). 然后匹配非数字( \\D* )并匹配并捕获后面的数字( (\\d+) )。

Output 产量

111111

something like this: 这样的事情:

In [13]: s=r'\"productId\":\"111111\"'

In [14]: print s
\"productId\":\"111111\"

In [15]: import re

In [16]: re.findall(r'\d+', s)
Out[16]: ['111111']

The backslashes here might add to the confusion, because they are used as an escape character both by (non-raw) Python strings and by the regexp syntax. 这里的反斜杠可能会增加混淆,因为它们被(非原始)Python字符串和regexp语法用作转义字符。

This extracts the product ids from the format you posted: 这将从您发布的格式中提取产品ID:

re_prodId = re.compile(r'\\"productId\\":\\"([^"]+)\\"')

The raw string r'...' does away with one level of backslash escaping; 原始字符串r'...'消除了一级反斜杠逃逸; the use of a single quote as the string delimiter does away with the need to escape double quotes; 使用单引号作为字符串分隔符不需要转义双引号; and finally the backslashe are doubled (only once) because of their special meaning in the regexp language. 最后,由于它们在正则表达式语言中的特殊含义,后面的内容被加倍(只有一次)。

You can use the regexp object's findall() method to find all matches in some text: 您可以使用regexp对象的findall()方法查找某些文本中的所有匹配项:

re_prodId.findall(text_to_search)

This will return a list of all product ids. 这将返回所有产品ID的列表。

Try this, 试试这个,

 :\\"(\d*)\\"

Give more examples of your data if this doesn't do what you want. 如果这不符合您的要求,请提供更多数据示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM