使用Python从子字符串中提取某些字符串

Question

I have a large document which I am trying to extract certain data from using Pythonv3. 我有一个很大的文档，我试图从使用Pythonv3中提取某些数据。 Text similar to the below is repeated and I want to extract the "123456789" and "987654321" each time the "pic=" and "originalName=" strings are identified. 重复类似于以下内容的文本，并且每当标识“ pic =“和“ originalName =””字符串时，我都希望提取“ 123456789”和“ 987654321”。

"this is some text pic=123456789 some more text originalName="987654321.jpg then some more text" “这是一些文本pic = 123456789一些其他文本originalName =“ 987654321.jpg然后是一些其他文本”

Can anyone assist? 有人可以协助吗？

Answer 1

You can try this: 您可以尝试以下方法：

import re
s= 'this is some text pic=123456789 some more text originalName="987654321.jpg then some more text'
data = re.findall('(?<=pic\=)\d+|(?<=originalName\=\")\d+', s)

Output: 输出：

['123456789', '987654321']

Answer 2

You'll want to use python's library for regular expressions . 您需要将python的库用于正则表达式。 Regular expressions are a useful way to search for patterns in text. 正则表达式是搜索文本模式的有用方法。 In this case, the other commenters have already provided a working snippet: 在这种情况下，其他评论者已经提供了一个有效的代码段：

import re
s= 'this is some text pic=123456789 some more text originalName="987654321.jpg then some more text'
data = re.findall('(?<=pic\=)\d+|(?<=originalName\=\")\d+', s)

This looks like nonsense at first, so here's a breakdown: 乍一看这似乎是胡说八道，所以这里是一个细分：

re.findall returns all matches to the specified pattern in the specified string. re.findall将所有匹配项返回给指定字符串中的指定模式。

The first parameter to findall is the regular expression pattern, enclosed by single quotes. findall的第一个参数是正则表达式模式，用单引号引起来。 A regular expression can be just a word; 正则表达式可以只是一个单词； re.findall('apple', s) would return all instances of the word "apple" in s. re.findall('apple', s)将返回re.findall('apple', s) “ apple”一词的所有实例。 However, there are several characters with special meaning to help describe more general patterns. 但是，有几个具有特殊含义的字符可以帮助描述更通用的模式。

\\d matches any digit 0-9. \\d匹配0-9的任何数字。 \\d+ matches a sequence of digits 0-9 of any length. \\d+匹配任意长度的数字0-9序列。

The | | in the middle separates two regular expressions. 在中间分隔两个正则表达式。 If either pattern is matched, the overall expression returns a match. 如果任何一个模式都匹配，则整个表达式返回匹配项。

(?<= ... ) is called a positive lookbehind. (?<= ... )被称为正向后看。 This returns a match if there's a pattern that is preceded by the pattern described in the ... . 如果在... 之前有一个模式，则返回匹配项。

= and " have special meanings, so \\= and \\" specify that those characters are supposed to be used normally. =和"具有特殊含义，因此\\=和\\"指定应该正常使用这些字符。

So '(?<=pic\\=)\\d+' matches a sequence of digits of any length that is preceded by the string pic= . 因此'(?<=pic\\=)\\d+'匹配一个以字符串pic=开头的任意长度的数字序列。 And '(?<=originalName\\=\\")\\d+' matches a sequence of digits preceded by the string originalName=" . 并且'(?<=originalName\\=\\")\\d+'匹配一系列数字，其后为字符串originalName=" 。

The second parameter to findall is just the string to search for these patterns. findall的第二个参数只是用于搜索这些模式的字符串。 So re.findall('(?<=pic\\=)\\d+|(?<=originalName\\=\\")\\d+', s) will search s and return all sequences of digits with pic= before them, and all sequences of digits with originalName=" before them. 因此re.findall('(?<=pic\\=)\\d+|(?<=originalName\\=\\")\\d+', s)将搜索s并返回所有带pic=的数字序列，以及所有在它们之前带有originalName="的数字序列。

使用Python从子字符串中提取某些字符串

问题描述

2 个解决方案

解决方案1
1 2017-11-07 15:17:34

解决方案2
1 已采纳 2017-11-07 15:56:32

使用Python从子字符串中提取某些字符串

问题描述

2 个解决方案

解决方案1 1 2017-11-07 15:17:34

解决方案2 1 已采纳 2017-11-07 15:56:32

解决方案1
1 2017-11-07 15:17:34

解决方案2
1 已采纳 2017-11-07 15:56:32