简体   繁体   中英

Scraping a specific GTAG value from a website

I am trying to scrape website and return their GTM container ID , I found a solution which is only working for a single specific website.

Which is working for : ( https://www.observepoint.com/ )

import urllib3
import re
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
response = http.request('GET', "https://www.observepoint.com/")
soup = BeautifulSoup(response.data,"html.parser")
GTM = soup.head.findAll(text=re.compile(r'GTM'))
print(re.search("GTM-[A-Z0-9]{6,7}",str(GTM))[0])

But when I try it on another website for example https://www.dccomics.com/characters/superman%26sa%3DU%26ved%3D2ahUKEwi55uyMxfHxAhXMp5UCHTkMBekQFjAzegQIARAB%26usg%3DAOvVaw2PgfF7ZT6S6UeZpFImsXDC%2Cdccomics

it doesn't work (Returns None Object type) even though the GTM id value still exists and is on a same/similar iframe tag like in the previous website.

GTM Value for working script: 工作网站的 GTM 价值:

GTM Value for the website script isn't functioning on: 代码不起作用的网站的 GTM 值

import requests
import re

urls = [
    "https://www.observepoint.com/",
    "https://www.dccomics.com/characters/superman%26sa%3DU%26ved%3D2ahUKEwi55uyMxfHxAhXMp5UCHTkMBekQFjAzegQIARAB%26usg%3DAOvVaw2PgfF7ZT6S6UeZpFImsXDC%2Cdccomics",
]


def main(url):
    for url in urls:
        r = requests.get(url)
        match = re.findall("(GTM-[A-Z0-9]{6,7})", r.text)
        if match:
            print(set(match))


main("https://www.dccomics.com/characters/superman/")

Output:

{'GTM-5LS3NZ'}
{'GTM-538C4X'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM