創建一個以標題作為列名的數據框，<li> 將內容標記為行，然后將此數據框打印到文本文件中</li>

Question

我正在嘗試從該網站獲取主體數據

我想獲得一個數據框（或任何其他讓生活更輕松的 object）作為 output，副標題作為列名，副標題下的正文作為該列下的行。

我的代碼如下：

from bs4 import BeautifulSoup
import requests
import re

url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")

headings = []
lines = []

my_df = pd.DataFrame(index=range(100))
for strong in article.findAll('strong'):
    if strong.parent.name =='p':
        if strong.find(text=re.compile("News")):
            headings.append(strong.text)
            
#headings
k=0
for ul in article.findAll('ul'):
    for li in ul.findAll('li'):
        lines.append(li.text)
    lines= lines + [""]
    my_df[k] = pd.Series(lines)
    k=k+1
        
my_df

我想使用“標題”列表來獲取數據框列名稱。

顯然我沒有寫出正確的邏輯。 我也探索了 nextSibling、descendants 和其他屬性，但我無法找出正確的邏輯。 有人可以幫忙嗎？

Answer 1

獲得標題后，使用.find_next()獲取該新聞文章列表。 然后將它們作為字典中的關鍵字添加到標題下的列表中。 然后簡單地使用pd.concat()和ignore_index=False

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")

headlines = {}
news_headlines = article.find_all('p',text=re.compile("News"))
for news_headline in news_headlines:
    end_of_news = False
    sub_title = news_headline.find_next('p')
    headlines[news_headline.text] = []
    #print(news_headline.text)
    while end_of_news == False:
        headlines[news_headline.text].append(sub_title.text)
        articles = sub_title.find_next('ul')
        for li in articles.findAll('li'):
            headlines[news_headline.text].append(li.text)
            #print(li.text)
        sub_title = articles.find_next('p')
        if 'News' in sub_title.text or sub_title.text == '' :
            end_of_news = True
    
            
df_list = []
for headings, lines in headlines.items():
    temp = pd.DataFrame({headings:lines})
    df_list.append(temp)
    

my_df = pd.concat(df_list, ignore_index=False, axis=1)

Output：

print(my_df)
                                       National News  ...                                    Obituaries News
0  1. Cabinet approves 100% FDI under automatic r...  ...  11. Eminent Kashmiri Writer Aziz Hajini passes...
1  The Union Cabinet, chaired by Prime Minister N...  ...  Noted writer and former secretary of Jammu and...
2  A total of 9 structural and 5 process reforms ...  ...  He has over twenty books in Kashmiri to his cr...
3  Change in the definition of AGR: The definitio...  ...  12. Former India player and Mohun Bagan great ...
4  Rationalised Spectrum Usage Charges: The month...  ...  Former India footballer and Mohun Bagan captai...
5  Four-year Moratorium on dues: Moratorium has b...  ...  Bhabani Roy helped Mohun Bagan win the Rovers ...
6  Foreign Direct Investment (FDI): The governmen...  ...  13. 2 times Olympic Gold Medalist Yuriy Sedykh...
7  Auction calendar fixed: Spectrum auctions will...  ...  Double Olympic hammer throw gold medallist Yur...
8     Important takeaways for all competitive exams:  ...  He set the world record for the hammer throw w...
9      Minister of Communications: Ashwini Vaishnaw.  ...  He won his first gold medal at the 1976 Olympi...

[10 rows x 8 columns]

創建一個以標題作為列名的數據框，<li> 將內容標記為行，然后將此數據框打印到文本文件中</li>

問題描述

1 個解決方案

解決方案1
0 已采納 2021-10-06 13:21:54

創建一個以標題作為列名的數據框，<li> 將內容標記為行，然后將此數據框打印到文本文件中</li>

問題描述

1 個解決方案

解決方案1 0 已采納 2021-10-06 13:21:54

解決方案1
0 已采納 2021-10-06 13:21:54