簡體   English   中英

為什么 beautifulsoup4 中的 find_all() function 不能全部抓取<h3>標簽</h3>

[英]Why won't the find_all() function from beautifulsoup4 not grab all <h3> tags

import requests
import pprint as pp
from bs4 import BeautifulSoup as soup
headers = {
     'User-Agent': 'some_name',
        'From': 'some_email'
}
URL = 'https://www.reddit.com/r/wallstreetbets/'
page = requests.get(URL, headers = headers)
page_html = page.content

page_soup = soup(page_html, "html.parser")

print(page_soup.find_all('h3'))


print(page.status_code)
page.close()

這是我第一次使用 beautifulsoup,我正在嘗試學習如何使用它。 出於某種原因,當我嘗試抓取標簽時,它只抓取前 8 個然后停止。 我不明白如何讓它抓住每個標簽。 我已經嘗試指定 class ,但這並沒有解決問題。

要獲取所有鏈接,您可以使用版本的 Reddit。

例如:

import requests
from bs4 import BeautifulSoup as soup


URL = 'https://old.reddit.com/r/wallstreetbets/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',}
page_soup = soup(requests.get(URL, headers = headers).content, "html.parser")

for p in page_soup.select('p.title'):
    print(p.get_text(strip=True, separator=' '))

印刷:

What Are Your Moves Tomorrow, June 15, 2020 Daily Discussion ( self.wallstreetbets )
They are getting ready for Monday. Meme ( v.redd.it )
Chill Session incoming this week Meme ( v.redd.it )
Just a bull huntin for some calls Meme ( v.redd.it )
this does not feel bullish Meme ( i.imgur.com )
I'm from the past. Here's what's going to happen. Discussion ( self.wallstreetbets )
Bulls tread lightly we're in for a gong show Discussion ( self.wallstreetbets )
I've been workin' on this meme for a while...It's about Friendship Meme ( v.redd.it )
I've got a great idea to fix my portfolio ( sound on ) OC Meme ( v.redd.it )
Welcome to the Kang Gang OC Meme ( i.redd.it )
DDDD - Retail Investors, Bankruptcies, Dark Pools and Beauty Contests OC DD ( self.wallstreetbets )
We made WSJ lol Discussion ( wsj.com )
The Great Gay Bear Trade Fundamentals ( self.wallstreetbets )
US Important news this week (est) Discussion ( self.wallstreetbets )
How George Floyd Cured COVID (and why we're never locking down again) DD ( self.wallstreetbets )
The Kang Gang Manifesto - A 2-month journey from $120k to $210k Gain ( self.wallstreetbets )
The unofficial wallstreetbets alignment chart Meme ( i.redd.it )
Bigly expirations this Friday, watch out Discussion ( self.wallstreetbets )
Amazon Set to Face Antitrust Charges in European Union Stocks ( nytimes.com )
The Convergence of Retardation and Philanthropy......Autists United, Inc. DD ( self.wallstreetbets )
Ending the Kangaroo Market (Sound On) Meme ( v.redd.it )
Hey Dontsweatit32 - hold my beer and take a ban Options ( i.redd.it )
Hewooo Retards, Carebear here warning you about the incoming Monday's rug pull. DD ( self.wallstreetbets )
DGLY Sympathy Plays Discussion ( self.wallstreetbets )
Is Apple going going to another new All Time High??? Discussion ( self.wallstreetbets )
I'm all in on spce YOLO ( self.wallstreetbets )

編輯:如果你想使用新版本,你可以試試這個例子(它需要用re / json模塊解析 JavaScript):

import re
import json
import requests
from bs4 import BeautifulSoup as soup


URL = 'https://www.reddit.com/r/wallstreetbets/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',}
page_soup = soup(requests.get(URL, headers = headers).content, "html.parser")

txt = page_soup.select_one('script#data').contents[0]

data = json.loads(re.search(r'window\.___r = (.*?});', txt).group(1))

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for v in data['posts']['models'].values():
    print(v['title'])

印刷:

What Are Your Moves Tomorrow, June 15, 2020
They are getting ready for Monday.
Chill Session incoming this week
Just a bull huntin for some calls
this does not feel bullish
I'm from the past. Here's what's going to happen.
Bulls tread lightly we're in for a gong show
I've been workin' on this meme for a while...It's about Friendship
DDDD - Retail Investors, Bankruptcies, Dark Pools and Beauty Contests
I've got a great idea to fix my portfolio ( sound on )
Welcome to the Kang Gang
We made WSJ lol
The Great Gay Bear Trade
US Important news this week (est)
How George Floyd Cured COVID (and why we're never locking down again)
The Kang Gang Manifesto - A 2-month journey from $120k to $210k
The unofficial wallstreetbets alignment chart
Bigly expirations this Friday, watch out
Amazon Set to Face Antitrust Charges in European Union
The Convergence of Retardation and Philanthropy......Autists United, Inc.
Ending the Kangaroo Market (Sound On)
Hey Dontsweatit32 - hold my beer and take a ban
Hewooo Retards, Carebear here warning you about the incoming Monday's rug pull.
We did it again. The second wave is coming soon and I am all in with PUTs in everything!
I'm all in on spce
DGLY Sympathy Plays

我無法在您的代碼中找到錯誤,但它確實對我有用

import requests
from bs4 import BeautifulSoup

url = "https://www.reddit.com/r/wallstreetbets/"
headers = {"User-Agent": "wswp"}

with requests.Session() as session:
    response = session.get(url, headers=headers)
    content = response.content

soup = BeautifulSoup(content, "html.parser")
titles = soup.find_all("h3")
for h3 in titles:
    print(h3.text)

2020 年 6 月 15 日,你明天有什么動作

他們正在為星期一做准備。

寒意Session本周來貨

只是一些電話的公牛狩獵

這並不樂觀

我來自過去。 這就是將要發生的事情。

公牛隊輕裝上陣,我們正在參加一場鑼秀

我有一個好主意來修復我的投資組合(聲音打開)

我只建議您更改User-Agent ,因為 reddit 會阻止多次發送請求的 User-Agent

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM