简体   繁体   中英

Extracting certain strings from a substring using Python

I have a large document which I am trying to extract certain data from using Pythonv3. Text similar to the below is repeated and I want to extract the "123456789" and "987654321" each time the "pic=" and "originalName=" strings are identified.

"this is some text pic=123456789 some more text originalName="987654321.jpg then some more text"

Can anyone assist?

You can try this:

import re
s= 'this is some text pic=123456789 some more text originalName="987654321.jpg then some more text'
data = re.findall('(?<=pic\=)\d+|(?<=originalName\=\")\d+', s)

Output:

['123456789', '987654321']

You'll want to use python's library for regular expressions . Regular expressions are a useful way to search for patterns in text. In this case, the other commenters have already provided a working snippet:

import re
s= 'this is some text pic=123456789 some more text originalName="987654321.jpg then some more text'
data = re.findall('(?<=pic\=)\d+|(?<=originalName\=\")\d+', s)

This looks like nonsense at first, so here's a breakdown:

re.findall returns all matches to the specified pattern in the specified string.

The first parameter to findall is the regular expression pattern, enclosed by single quotes. A regular expression can be just a word; re.findall('apple', s) would return all instances of the word "apple" in s. However, there are several characters with special meaning to help describe more general patterns.

\\d matches any digit 0-9. \\d+ matches a sequence of digits 0-9 of any length.

The | in the middle separates two regular expressions. If either pattern is matched, the overall expression returns a match.

(?<= ... ) is called a positive lookbehind. This returns a match if there's a pattern that is preceded by the pattern described in the ... .

= and " have special meanings, so \\= and \\" specify that those characters are supposed to be used normally.

So '(?<=pic\\=)\\d+' matches a sequence of digits of any length that is preceded by the string pic= . And '(?<=originalName\\=\\")\\d+' matches a sequence of digits preceded by the string originalName=" .

The second parameter to findall is just the string to search for these patterns. So re.findall('(?<=pic\\=)\\d+|(?<=originalName\\=\\")\\d+', s) will search s and return all sequences of digits with pic= before them, and all sequences of digits with originalName=" before them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM