简体   繁体   English

如何通过bs4查找所有字符串?

[英]How to find all strings by bs4?

I want to parse a specific page with some images, but images are not in a fixed tag a, here are some examples: 我想用一些图像解析一个特定的页面,但图像不在固定的标签中,这里有一些例子:

<meta name="description" content="This is Text."><meta name="Keywords" content="Weather"><meta property="og:type" content="article"><meta property="og:title" content="Cloud"><meta property="og:description" content="Testing"><meta property="og:url" content="https://weathernews.jp/s/topics/201807/300285/"><meta property="og:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><meta name="twitter:title" content="【天地始粛】音や景色から感じる秋の気配"><meta name="twitter:description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><link rel="canonical" href="https://weathernews.jp/s/topics/201807/300285/"><link rel="amphtml" href="https://weathernews.jp/s/topics/201807/300285/amp.html"><script async="async" src="https://www.googletagservices.com/tag/js/gpt.js"></script>
<img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
<img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518">

I tried to use code as below to get all images, but no any results, what can I do? 我尝试使用下面的代码来获取所有图像,但没有任何结果,我该怎么办?

soup.find_all(string=re.compile(r"(https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+)\?[0-9]+"))

I personally think this is one of the rare cases when applying a regular expression to the complete document without using an HTML parser is the easiest and a good way to go . 我个人认为这是在不使用HTML解析器的情况下将正则表达式应用于完整文档的极少数情况之一,这是最简单和最好的方法 And, since you are actually just looking for URLs and not matching any HTML tags in the regular expression, points made in this thread are not valid for this case: 并且,由于您实际上只是在查找URL而不匹配正则表达式中的任何HTML标记,因此此线程中的点对于这种情况无效:

In [1]: data = """
   ...: <meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
   ...: <img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https:
   ...: //smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
   ...: <img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518
   ...: ">
   ...: """

In [2]: import re

In [3]: pattern = re.compile(r"https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+\?[0-9]+")

In [4]: pattern.findall(data)
Out[4]: 
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

If you are though interested in how would you apply a regular expression pattern to multiple attributes in BeautifulSoup , it may be something along these lines (not pretty, I know): 如果您对如何将正则表达式模式应用于BeautifulSoup多个属性感兴趣,那么它可能是这些内容(不是很漂亮,我知道):

In [6]: results = soup.find_all(lambda tag: any(pattern.search(attr) for attr in tag.attrs.values()))

In [7]: [next(attr for attr in tag.attrs.values() if pattern.search(attr)) for tag in results]
Out[7]: 
[u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

Here we are basically iterating over all attributes of all elements and checking for a pattern match. 这里我们基本上迭代所有元素的所有属性并检查模式匹配。 Then, once we get all the matching tags we are iterating over the results and get a value of a matching attribute. 然后,一旦我们得到所有匹配的标签,我们就会迭代结果并获得匹配属性的值。 I really don't like the fact that we apply the regex check twice - when looking for tags and when checking for a desired attribute of a matched tag. 我真的不喜欢我们应用正则表达式检查两次 - 在查找标签和检查匹配标签的所需属性时。


lxml.html and it's XPath powers allow working with attributes directly, but lxml supports XPath 1.0 which does not have regular expression support. lxml.html及其XPath权限允许直接使用属性,但lxml支持XPath 1.0,它没有正则表达式支持。 You can do smth like: 你可以做像:

In [10]: from lxml.html import fromstring

In [11]: root = fromstring(data)

In [12]: root.xpath('.//@*[contains(., "smtgvs.weathernews.jp") and contains(., "?")]') 
Out[12]: 
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518'] 

which is not 100% what you did and would probably generate false positives, but you can take it further and add more "substring in a string" checks if needed. 这不是100%你做了什么,可能会产生误报,但你可以进一步,并在需要时添加更多“字符串中的子字符串”检查。

Or, you can grab all the attributes of all elements and filter using the regex you already have: 或者,您可以使用您已有的正则表达式获取所有元素的所有属性并进行过滤:

In [14]: [attr for attr in root.xpath("//@*") if pattern.search(attr)]
Out[14]: 
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM