簡體   English   中英

創建一個以標題作為列名的數據框,<li> 將內容標記為行,然后將此數據框打印到文本文件中</li>

[英]Create a data frame with headings as column names and <li> tag content as rows, then print this data frame into a text file

我正在嘗試從該網站獲取主體數據

我想獲得一個數據框(或任何其他讓生活更輕松的 object)作為 output,副標題作為列名,副標題下的正文作為該列下的行。

我的代碼如下:

from bs4 import BeautifulSoup
import requests
import re

url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")

headings = []
lines = []

my_df = pd.DataFrame(index=range(100))
for strong in article.findAll('strong'):
    if strong.parent.name =='p':
        if strong.find(text=re.compile("News")):
            headings.append(strong.text)
            
#headings
k=0
for ul in article.findAll('ul'):
    for li in ul.findAll('li'):
        lines.append(li.text)
    lines= lines + [""]
    my_df[k] = pd.Series(lines)
    k=k+1
        
my_df

我想使用“標題”列表來獲取數據框列名稱。

顯然我沒有寫出正確的邏輯。 我也探索了 nextSibling、descendants 和其他屬性,但我無法找出正確的邏輯。 有人可以幫忙嗎?

獲得標題后,使用.find_next()獲取該新聞文章列表。 然后將它們作為字典中的關鍵字添加到標題下的列表中。 然后簡單地使用pd.concat()ignore_index=False

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")

headlines = {}
news_headlines = article.find_all('p',text=re.compile("News"))
for news_headline in news_headlines:
    end_of_news = False
    sub_title = news_headline.find_next('p')
    headlines[news_headline.text] = []
    #print(news_headline.text)
    while end_of_news == False:
        headlines[news_headline.text].append(sub_title.text)
        articles = sub_title.find_next('ul')
        for li in articles.findAll('li'):
            headlines[news_headline.text].append(li.text)
            #print(li.text)
        sub_title = articles.find_next('p')
        if 'News' in sub_title.text or sub_title.text == '' :
            end_of_news = True
    
            
df_list = []
for headings, lines in headlines.items():
    temp = pd.DataFrame({headings:lines})
    df_list.append(temp)
    

my_df = pd.concat(df_list, ignore_index=False, axis=1) 

Output:

print(my_df)
                                       National News  ...                                    Obituaries News
0  1. Cabinet approves 100% FDI under automatic r...  ...  11. Eminent Kashmiri Writer Aziz Hajini passes...
1  The Union Cabinet, chaired by Prime Minister N...  ...  Noted writer and former secretary of Jammu and...
2  A total of 9 structural and 5 process reforms ...  ...  He has over twenty books in Kashmiri to his cr...
3  Change in the definition of AGR: The definitio...  ...  12. Former India player and Mohun Bagan great ...
4  Rationalised Spectrum Usage Charges: The month...  ...  Former India footballer and Mohun Bagan captai...
5  Four-year Moratorium on dues: Moratorium has b...  ...  Bhabani Roy helped Mohun Bagan win the Rovers ...
6  Foreign Direct Investment (FDI): The governmen...  ...  13. 2 times Olympic Gold Medalist Yuriy Sedykh...
7  Auction calendar fixed: Spectrum auctions will...  ...  Double Olympic hammer throw gold medallist Yur...
8     Important takeaways for all competitive exams:  ...  He set the world record for the hammer throw w...
9      Minister of Communications: Ashwini Vaishnaw.  ...  He won his first gold medal at the 1976 Olympi...

[10 rows x 8 columns]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM