I'm trying to extract the # of pages in a multipage table URL
HTML=<span style="float:right">Page 1 of 63,917</span>
Need to extract 63917.
I used
soup = bsoup(r.text)
pages=re.findall(r"Page 1 of\s(.+)<\/span>", str(soup))
print(pages)
But the print(pages) returns a whole lot of HTML right till the end of the body
##'63,917</span></div><table class="table table-striped##
Why doesn't my regex work? And how do i extract only the # from the HTML response?
Your regex does not work because you are using greedy capture in your grouping parentheses (.+)
. The way you have it written, the .+
is matching everything from Page 1 of\\s
onward (until the last </span>
tag in the document). You need to use non-greedy capture by adding a ?
after the +
, like this:
Page 1 of\s(.+?)<\/span>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.