幫助這個內容提取+美麗的湯

Question

我正在嘗試從這種格式的站點中提取數據

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
<div id=storytext class=storytext> 
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
..... extra stuff
</div>  **Main Content**
</div>
</div>

請注意， MainContent 可以包含其他標簽，但我想要整個內容，如字符串

所以我做的是這個

_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null

因此 _divTag 將只有主要內容，但這不起作用。 誰能告訴我犯了什么錯誤以及我應該如何提取主要內容

Answer 1

只需執行_divTag.contents[2] 。

您的格式可能會誤導您 - 此文本不屬於最里面的 div 標簽（因為innerdiv.text 、 innerdiv.contents或innerdiv.findChildren()將顯示給您）。

如果您縮進原始 XML，事情會更清楚：

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
  <div id=storytext class=storytext> 
    <div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
      ..... extra stuff
    </div>  **Main Content**
  </div>
</div>

（PS：我不清楚您的innerdiv.contents[0].replaceWith("")的意圖是什么？壓制屬性？換行符？無論如何，BS 的哲學不是編輯解析樹，而是簡單地忽略你不關心的 99.9%。BS 文檔在這里）。

幫助這個內容提取+美麗的湯

問題描述

1 個解決方案

解決方案1
2 已采納 2011-07-14 22:23:32

幫助這個內容提取+美麗的湯

問題描述

1 個解決方案

解決方案1 2 已采納 2011-07-14 22:23:32

解決方案1
2 已采納 2011-07-14 22:23:32