[英]Python / Beautiful Soup - meta content not matching source
I'm trying to grab meta data content from a website.我正在尝试从网站获取元数据内容。 Here is the code:
这是代码:
import requests
from bs4 import BeautifulSoup
url = "https://discord.com/invite/midjourney"
result = requests.get(url=url)
soup = BeautifulSoup(result.content, 'html5lib')
target = soup.find("meta", property="og:description")
print(target)
This returns:这将返回:
<meta content="Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities." property="og:description"/>
However, looking at the page source, the content is different and it includes the number of members.但是,查看页面来源,内容不同,它包括成员数量。 The number of members is what I'm after.
成员的数量是我所追求的。
<meta property="og:description" content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,472,611 members" />
Is there some type of script dynamically changing the meta content?是否有某种类型的脚本动态更改元内容? Any ideas on how to get under the meta to the actual data?
关于如何从元数据中获取实际数据的任何想法?
try:尝试:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers) # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#1. extract all meta tags from the page, return list of tags
print(soup.select('meta'))
[<meta charset="utf-8"/>,
<meta content="width=device-width, initial-scale=1.0, maximum-scale=3.0" name="viewport"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="description"/>,
<meta content="summary_large_image" name="twitter:card"/>,
<meta content="@discord" name="twitter:site"/>,
<meta content="Join the Midjourney Discord Server!" name="twitter:title"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="twitter:description"/>,
<meta content="Join the Midjourney Discord Server!" property="og:title"/>,
<meta content="https://discord.com/invite/midjourney" property="og:url"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" property="og:description"/>,
<meta content="Discord" property="og:site_name"/>,
<meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" property="og:image"/>,
<meta content="image/jpeg" property="og:image:type"/>,
<meta content="512" property="og:image:width"/>,
<meta content="512" property="og:image:height"/>,
<meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" name="twitter:image"/>]
#2. extract all content of the meta tags, return list of text
content_only = [i.get('content') for i in soup.select('meta') if i.get('content')]
print(content_only)
['width=device-width, initial-scale=1.0, maximum-scale=3.0',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'summary_large_image',
'@discord',
'Join the Midjourney Discord Server!',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'Join the Midjourney Discord Server!',
'https://discord.com/invite/midjourney',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'Discord',
'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512',
'image/jpeg',
'512',
'512',
'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512']
#3. extract the members data that you need
members_content_only = list(set([i.get('content') for i in soup.select('meta') if i.get('content') and 'members' in i.get('content')]))
print(members_content_only)
['The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members']
There is indeed js underneath.下面确实有js。 I found a different method to extract this using selenium and bs4.
我找到了一种不同的方法来使用 selenium 和 bs4 来提取它。
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import requests
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.support.ui import WebDriverWait
options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
url = "https://discord.com/invite/midjourney"
driver.get(url)
WebDriverWait(driver, 15)
page = driver.page_source
html = bs(page, 'html.parser') #print(html)
for script in html(["script", "style"]):
script.extract()
text = html.get_text()
lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)
final_string = text.replace(",","")
start = final_string.find("Online")+6
end = final_string.find("Members")-1
subs = final_string[start:end]
subs_final = int(subs)
print(subs_final)
Output: Output:
2496142
This was a roundabout way to get what I wanted.这是获得我想要的东西的迂回方式。 Lmk if there are more efficient ways to do this.
Lmk 如果有更有效的方法来做到这一点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.