从网站上用美丽的汤刮擦文本字符串

Question

I would like to scrape a webpage and just return the GTM (Google Tag Manager) container ID (In the example below it would be GTM-5LS3NZ). 我想抓取一个网页，然后只返回GTM（Google跟踪代码管理器）容器ID（在下面的示例中为GTM-5LS3NZ）。 The code shouldn't look for the exact container ID but rather the pattern as I will use it on muultiple sites. 代码不应该查找确切的容器ID，而应该查找模式，因为我将在多个站点上使用它。

So far I can search the head and print the entire piece of text containing GTM, but I don't know how to format the find and the regex together to just return GTM-5LS3NZ (In this example). 到目前为止，我可以搜索标题并打印包含GTM的整个文本，但是我不知道如何将搜索结果和正则表达式一起格式化以仅返回GTM-5LS3NZ（在此示例中）。

import urllib3
import re
from bs4 import BeautifulSoup

http = urllib3.PoolManager()

response = http.request('GET', "https://www.observepoint.com/")
soup = BeautifulSoup(response.data,"html.parser")

GTM = soup.head.findAll(text=re.compile(r'GTM'))
print(GTM)

Note: The GTM ID can have 6 or 7 alphanumeric characters so I would expect the regex for the container ID to be something like ^GTM-[A-Z0-9] - I don't know how to specify 6 or 7 characters. 注意：GTM ID可以包含6或7个字母数字字符，因此我希望容器ID的正则表达式类似于^ GTM- [A-Z0-9]-我不知道如何指定6或7个字符。

Clarification on what I am after. 澄清我的追求。 If you run the code above you get the following. 如果运行上面的代码，则会得到以下内容。

["(function (w, d, s, l, i) {\n      w[l] = w[l] || [];\n      w[l].push({\n        'gtm.start': new Date().getTime(),\n        event: 'gtm.js'\n      });\n      var f = d.getElementsByTagName(s)[0],\n        j = d.createElement(s),\n        dl = l != 'dataLayer' ? '&l=' + l : '';\n      j.async = true;\n      j.src =\n        'https://www.googletagmanager.com/gtm.js?id=' + i + dl;\n      f.parentNode.insertBefore(j, f);\n    })(window, document, 'script', 'dataLayer', 'GTM-5LS3NZ');"]

Where all I want is GTM-5LS3NZ. 我要的是GTM-5LS3NZ。

Answer 1

I have worked it out now, thanks to the help in the comments. 感谢评论中的帮助，我现在已经解决了。 This is what I was after: 这就是我所追求的：

import re
from bs4 import BeautifulSoup

http = urllib3.PoolManager()

response = http.request('GET', "https://www.observepoint.com/")
soup = BeautifulSoup(response.data,"html.parser")

GTM = soup.head.findAll(text=re.compile(r'GTM'))
print(re.search("GTM-[A-Z0-9]{6,7}",str(GTM))[0])

Answer 2

I did something similar a few days ago, and a quick rewrite gives me: 我几天前做了类似的事情，然后快速重写就给了我：

import urllib3
import re
from bs4 import BeautifulSoup

http = urllib3.PoolManager()

response = http.request('GET', "https://www.observepoint.com/")
soup = BeautifulSoup(response.data,"html.parser")

pattern  =re.compile(r'GTM-([a-zA-Z0-9]{6,7})')
found = soup.head.find(text=pattern)
if found:
    match = pattern.search(found)
    if match:
        print(match.group(1))

This gives me GTM-5LS3NZ as output. 这给了我GTM-5LS3NZ作为输出。

Answer 3

You could also extract from appropriate comment 您也可以从适当的评论中提取

import requests
from bs4 import BeautifulSoup, Comment

r = requests.get('https://www.observepoint.com/')
soup = BeautifulSoup(r.content, 'lxml')
for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if 'iframe' in comment:
        soup = BeautifulSoup(comment, 'lxml')
        id = soup.select_one('iframe')['src'].split('=')[1]
        print(id)
        break

从网站上用美丽的汤刮擦文本字符串

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-04-23 14:44:58

解决方案2
1 2019-04-23 14:47:54

解决方案3
0 2019-04-23 15:57:26

从网站上用美丽的汤刮擦文本字符串

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-04-23 14:44:58

解决方案2 1 2019-04-23 14:47:54

解决方案3 0 2019-04-23 15:57:26

解决方案1
1 已采纳 2019-04-23 14:44:58

解决方案2
1 2019-04-23 14:47:54

解决方案3
0 2019-04-23 15:57:26