I have a large document which I am trying to extract certain data from using Pythonv3. Text similar to the below is repeated and I want to extract the "123456789" and "987654321" each time the "pic=" and "originalName=" strings are identified.
"this is some text pic=123456789 some more text originalName="987654321.jpg then some more text"
Can anyone assist?
You can try this:
import re
s= 'this is some text pic=123456789 some more text originalName="987654321.jpg then some more text'
data = re.findall('(?<=pic\=)\d+|(?<=originalName\=\")\d+', s)
Output:
['123456789', '987654321']
You'll want to use python's library for regular expressions . Regular expressions are a useful way to search for patterns in text. In this case, the other commenters have already provided a working snippet:
import re
s= 'this is some text pic=123456789 some more text originalName="987654321.jpg then some more text'
data = re.findall('(?<=pic\=)\d+|(?<=originalName\=\")\d+', s)
This looks like nonsense at first, so here's a breakdown:
re.findall returns all matches to the specified pattern in the specified string.
The first parameter to findall is the regular expression pattern, enclosed by single quotes. A regular expression can be just a word; re.findall('apple', s)
would return all instances of the word "apple" in s. However, there are several characters with special meaning to help describe more general patterns.
\\d
matches any digit 0-9. \\d+
matches a sequence of digits 0-9 of any length.
The |
in the middle separates two regular expressions. If either pattern is matched, the overall expression returns a match.
(?<= ... )
is called a positive lookbehind. This returns a match if there's a pattern that is preceded by the pattern described in the ...
.
=
and "
have special meanings, so \\=
and \\"
specify that those characters are supposed to be used normally.
So '(?<=pic\\=)\\d+'
matches a sequence of digits of any length that is preceded by the string pic=
. And '(?<=originalName\\=\\")\\d+'
matches a sequence of digits preceded by the string originalName="
.
The second parameter to findall is just the string to search for these patterns. So re.findall('(?<=pic\\=)\\d+|(?<=originalName\\=\\")\\d+', s)
will search s and return all sequences of digits with pic=
before them, and all sequences of digits with originalName="
before them.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.