简体   繁体   English

Python / Beautiful Soup - 元内容与源不匹配

[英]Python / Beautiful Soup - meta content not matching source

I'm trying to grab meta data content from a website.我正在尝试从网站获取元数据内容。 Here is the code:这是代码:

import requests
from bs4 import BeautifulSoup

url = "https://discord.com/invite/midjourney"
result = requests.get(url=url)
soup = BeautifulSoup(result.content, 'html5lib')

target = soup.find("meta", property="og:description")
print(target)

This returns:这将返回:

<meta content="Discord is the easiest way to communicate over voice, video, and text.  Chat, hang out, and stay close with your friends and communities." property="og:description"/>

However, looking at the page source, the content is different and it includes the number of members.但是,查看页面来源,内容不同,它包括成员数量。 The number of members is what I'm after.成员的数量是我所追求的。

<meta property="og:description" content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,472,611 members" />

Is there some type of script dynamically changing the meta content?是否有某种类型的脚本动态更改元内容? Any ideas on how to get under the meta to the actual data?关于如何从元数据中获取实际数据的任何想法?

try:尝试:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers)     # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')

#1. extract all meta tags from the page, return list of tags
print(soup.select('meta'))

[<meta charset="utf-8"/>,
 <meta content="width=device-width, initial-scale=1.0, maximum-scale=3.0" name="viewport"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="description"/>,
 <meta content="summary_large_image" name="twitter:card"/>,
 <meta content="@discord" name="twitter:site"/>,
 <meta content="Join the Midjourney Discord Server!" name="twitter:title"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="twitter:description"/>,
 <meta content="Join the Midjourney Discord Server!" property="og:title"/>,
 <meta content="https://discord.com/invite/midjourney" property="og:url"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" property="og:description"/>,
 <meta content="Discord" property="og:site_name"/>,
 <meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" property="og:image"/>,
 <meta content="image/jpeg" property="og:image:type"/>,
 <meta content="512" property="og:image:width"/>,
 <meta content="512" property="og:image:height"/>,
 <meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" name="twitter:image"/>]

#2. extract all content of the meta tags, return list of text
content_only = [i.get('content') for i in soup.select('meta') if i.get('content')]

print(content_only)

['width=device-width, initial-scale=1.0, maximum-scale=3.0',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'summary_large_image',
 '@discord',
 'Join the Midjourney Discord Server!',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'Join the Midjourney Discord Server!',
 'https://discord.com/invite/midjourney',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'Discord',
 'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512',
 'image/jpeg',
 '512',
 '512',
 'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512']

#3. extract the members data that you need
members_content_only = list(set([i.get('content') for i in soup.select('meta') if i.get('content') and 'members' in i.get('content')]))

print(members_content_only)

['The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members']


There is indeed js underneath.下面确实有js。 I found a different method to extract this using selenium and bs4.我找到了一种不同的方法来使用 selenium 和 bs4 来提取它。

from bs4 import BeautifulSoup as bs
from selenium import webdriver
import requests
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.support.ui import WebDriverWait

options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)

url = "https://discord.com/invite/midjourney"
driver.get(url)

WebDriverWait(driver, 15)

page = driver.page_source
html = bs(page, 'html.parser') #print(html)

for script in html(["script", "style"]):
    script.extract()
text = html.get_text() 

lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)

final_string = text.replace(",","")
start = final_string.find("Online")+6
end = final_string.find("Members")-1
subs = final_string[start:end]
subs_final = int(subs)
print(subs_final)

Output: Output:

2496142

This was a roundabout way to get what I wanted.这是获得我想要的东西的迂回方式。 Lmk if there are more efficient ways to do this. Lmk 如果有更有效的方法来做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM