提取標簽內的文本和<div>單獨使用美湯

Question

<div class="quote">
    <b>Head 1</b> Text 1
</div>
<div class="quote">
    <b>Head 2</b> Text 2
    <br/> <b>Head 3</b> Text 3
</div>

我需要分別提取 head 1 head 2 head 3 和 text 1 text 2 text 3 。 我嘗試這樣做，但只有 head 1 , head 2 被 div 類的整個文本（包括 head 1 , head 2 ）提取。 PS 嵌套 b 標簽的數量可能因不同的 div 類而異。 我需要遍歷頁面中的所有 div class='quote'

def parser(url):
    page_content=BeautifulSoup(url.content,'html.parser')
    df=pd.DataFrame(columns=['Dialogues','Character'])
    for item in page_content.findAll('div',{'class':'quote'}):
            character= item.find('b').text[:-1]
            quotes=item.text
            df=df.append({'Dialogues':quotes,'Character': character},ignore_index=True)

    return df

編輯我需要這種格式的 df 的兩個單獨列中的數據。

Character   Quote
Head 1  Text 1
Head 2  Text 2
Head 3  Text 3

Answer 1

試試這個方法：

targets = page_content.select('div.quote')
for target in targets:
    for s in target.stripped_strings:
        print(s)

輸出：

Head 1
Text 1
Head 2
Text 2
Head 3
Text 3

編輯：

要添加到數據框：

import pandas as pd
heads = []
tails = []
targets = page_content.select('div.quote')
for target in targets:
    data = target.stripped_strings
    mu = list(data)
    for i in range(0,len(mu),2):
        heads.append(mu[i])
        tails.append(mu[i+1])

items = list(zip(heads,tails))
pd.DataFrame(items, columns=['Character','Quote'])

輸出：

    Character   Quote
0   Head 1  Text 1
1   Head 2  Text 2
2   Head 3  Text 3

提取標簽內的文本和<div>單獨使用美湯

問題描述

1 個解決方案

解決方案1
0 2020-03-14 23:03:43

提取標簽內的文本和<div>單獨使用美湯

問題描述

1 個解決方案

解決方案1 0 2020-03-14 23:03:43

解決方案1
0 2020-03-14 23:03:43