简体   繁体   中英

Getting link from web page with BeautifulSoup and scrolling for more

I'm trying to get links to articles from https://finance.yahoo.com/topic/stock-market-news I run the following code using python3

url = "https://finance.yahoo.com/topic/stock-market-news"
    r1 = requests.get(url)
    page = r1.content
    soup = BeautifulSoup(page, 'html5lib')
    #print(soup.prettify())
    href = soup.find_all('a')
    boxes = []
    links = []
    for ref in href:
        curr = ref.parent.find('u')
        if curr is not None:
            boxes.append(ref)
            links.append(ref['href'])
    print(boxes)
    print(links)

but while i do manage to get the links some of them looks weird

/news/stock-market-news-live-july-30-2020-221505732.html
/m/f39537a4-425d-3378-9ef7-e7188a513ca6/stock-index-futures-lower.html
/m/6c87eec2-e5a1-3bc3-916e-4f74b3c508bf/global-stocks-slump-as-u-s-.html
https://finance.yahoo.com/news/q2-gdp-us-economy-coronavirus-pandemic-consumer-171558880.html
https://finance.yahoo.com/video/influencers-andy-serwer-bill-gates-110000273.html
https://finance.yahoo.com/news/jobless-claims-week-ending-july-25-123150219.html

why is it happening and how can i now access those links?

another sub question, the site has a lot more links than what i am finding, i think it has to do with the site loading more as you scroll down, how could i bypass it so that i can load a certain amount of articles, for example 10 more?

Add this line links.append(link if link.startswith("https://finance.yahoo.com") else f"https://finance.yahoo.com{link}" )

from bs4 import BeautifulSoup
import requests
from requests import get

url = "https://finance.yahoo.com/topic/stock-market-news"
r1 = requests.get(url)
page = r1.content
soup = BeautifulSoup(page, 'html5lib')
#print(soup.prettify())
href = soup.find_all('a')
boxes = []
links = []
for ref in href:
    curr = ref.parent.find('u')
    if curr is not None:
        boxes.append(ref)
        link = ref['href']
        links.append(link if link.startswith("https://finance.yahoo.com") else f"https://finance.yahoo.com{link}" )
print(boxes)
print("___"*10)
print(links)

Output:

[<a class="Fw(b) Fz(18px) Lh(23px) LineClamp(2,46px) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 mega-item-header-link Td(n) C(#0078ff):h C(#000) LineClamp(2,46px) LineClamp(2,38px)--sm1024 not-isInStreamVideoEnabled" data-reactid="11" href="/m/d79af817-5b40-3545-a085-322c5d27628e/dow-futures-slump-as-q2-gdp.html" target="_self"><u class="StretchedBox" data-reactid="12"></u><!-- react-text: 13 -->Dow Futures Slump As Q2 GDP Plunges Most On Record, Weekly Jobless Claims Rise; Trump Raises Election Delay Prospect<!-- /react-text --></a>, <a class="Fw(b) Fz(18px) Lh(23px) LineClamp(2,46px) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 mega-item-header-link Td(n) C(#0078ff):h C(#000) LineClamp(2,46px) LineClamp(2,38px)--sm1024 not-isInStreamVideoEnabled" data-reactid="28" href="/m/8f0877fd-0c34-306c-964d-2c9dd2aebd3c/ups-stock-is-jumping-after.html" target="_self"><u class="StretchedBox" data-reactid="29"></u><!-- react-text: 30 -->UPS Stock Is Jumping After the Company Delivered Smashing Earnings<!-- /react-text --></a>, <a class="Fw(b) Fz(18px) Lh(23px) LineClamp(2,46px) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 mega-item-header-link Td(n) C(#0078ff):h C(#000) LineClamp(2,46px) LineClamp(2,38px)--sm1024 not-isInStreamVideoEnabled" data-reactid="48" href="/news/futures-sink-data-shows-historic-125417167.html"><u class="StretchedBox" data-reactid="49"></u><!-- react-text: 50 -->Futures sink as data shows historic slump<!-- /react-text --></a>, <a class="Fz(13px) LineClamp(4,96px) C(#0078ff):h Td(n) C($c-fuji-blue-4-b) smartphone_C(#000) smartphone_Fz(19px)" data-reactid="11" href="https://finance.yahoo.com/news/q2-gdp-us-economy-coronavirus-pandemic-consumer-171558880.html"><span class="Fw(600) smartphone_Fw(500)" data-reactid="12">Q2 GDP: US economy contracted by worst-ever 32.9% in Q2, crushed by coronavirus lockdowns</span><u class="StretchedBox Z(1)" data-reactid="13"></u></a>, <a class="Fz(13px) LineClamp(4,96px) C(#0078ff):h Td(n) C($c-fuji-blue-4-b) smartphone_C(#000) smartphone_Fz(19px)" data-reactid="26" href="https://finance.yahoo.com/video/influencers-andy-serwer-bill-gates-110000273.html"><span class="Fw(600) smartphone_Fw(500)" data-reactid="27">Influencers with Andy Serwer: Bill Gates</span><u class="StretchedBox Z(1)" data-reactid="28"></u></a>, <a class="Fz(13px) LineClamp(4,96px) C(#0078ff):h Td(n) C($c-fuji-blue-4-b) smartphone_C(#000) smartphone_Fz(19px)" data-reactid="38" href="https://finance.yahoo.com/news/jobless-claims-week-ending-july-25-123150219.html"><span class="Fw(600) smartphone_Fw(500)" data-reactid="39">Jobless claims top 1M again in latest week as coronavirus keeps battering workers</span><u class="StretchedBox Z(1)" data-reactid="40"></u></a>]
______________________________
['https://finance.yahoo.com/m/d79af817-5b40-3545-a085-322c5d27628e/dow-futures-slump-as-q2-gdp.html', 'https://finance.yahoo.com/m/8f0877fd-0c34-306c-964d-2c9dd2aebd3c/ups-stock-is-jumping-after.html', 'https://finance.yahoo.com/news/futures-sink-data-shows-historic-125417167.html', 'https://finance.yahoo.com/news/q2-gdp-us-economy-coronavirus-pandemic-consumer-171558880.html', 'https://finance.yahoo.com/video/influencers-andy-serwer-bill-gates-110000273.html', 'https://finance.yahoo.com/news/jobless-claims-week-ending-july-25-123150219.html']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM