I want to parse a specific page with some images, but images are not in a fixed tag a, here are some examples:
<meta name="description" content="This is Text."><meta name="Keywords" content="Weather"><meta property="og:type" content="article"><meta property="og:title" content="Cloud"><meta property="og:description" content="Testing"><meta property="og:url" content="https://weathernews.jp/s/topics/201807/300285/"><meta property="og:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><meta name="twitter:title" content="【天地始粛】音や景色から感じる秋の気配"><meta name="twitter:description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><link rel="canonical" href="https://weathernews.jp/s/topics/201807/300285/"><link rel="amphtml" href="https://weathernews.jp/s/topics/201807/300285/amp.html"><script async="async" src="https://www.googletagservices.com/tag/js/gpt.js"></script>
<img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
<img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518">
I tried to use code as below to get all images, but no any results, what can I do?
soup.find_all(string=re.compile(r"(https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+)\?[0-9]+"))
I personally think this is one of the rare cases when applying a regular expression to the complete document without using an HTML parser is the easiest and a good way to go . And, since you are actually just looking for URLs and not matching any HTML tags in the regular expression, points made in this thread are not valid for this case:
In [1]: data = """
...: <meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
...: <img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https:
...: //smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
...: <img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518
...: ">
...: """
In [2]: import re
In [3]: pattern = re.compile(r"https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+\?[0-9]+")
In [4]: pattern.findall(data)
Out[4]:
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
If you are though interested in how would you apply a regular expression pattern to multiple attributes in BeautifulSoup
, it may be something along these lines (not pretty, I know):
In [6]: results = soup.find_all(lambda tag: any(pattern.search(attr) for attr in tag.attrs.values()))
In [7]: [next(attr for attr in tag.attrs.values() if pattern.search(attr)) for tag in results]
Out[7]:
[u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
Here we are basically iterating over all attributes of all elements and checking for a pattern match. Then, once we get all the matching tags we are iterating over the results and get a value of a matching attribute. I really don't like the fact that we apply the regex check twice - when looking for tags and when checking for a desired attribute of a matched tag.
lxml.html
and it's XPath powers allow working with attributes directly, but lxml
supports XPath 1.0 which does not have regular expression support. You can do smth like:
In [10]: from lxml.html import fromstring
In [11]: root = fromstring(data)
In [12]: root.xpath('.//@*[contains(., "smtgvs.weathernews.jp") and contains(., "?")]')
Out[12]:
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
which is not 100% what you did and would probably generate false positives, but you can take it further and add more "substring in a string" checks if needed.
Or, you can grab all the attributes of all elements and filter using the regex you already have:
In [14]: [attr for attr in root.xpath("//@*") if pattern.search(attr)]
Out[14]:
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.