使用BeautifulSoup findAll將多行輸出組合成一行，具有多個類/標簽

Question

我正在嘗試構建一個從網頁收集文本的scraper。 我正在看兩個具有不同類名的特定div（“product-image”和“product-details”）。 我正在遍歷它們，從div中的每個“a”和“dd”標簽中抓取文本。

值得注意的是，這是我寫過的第一個Python程序......

這是我的代碼：

list_of_rows = []
for row in soup.findAll(True, {"class":["product-image", "product-details"]}):
    list_of_cells = []
    for cell in row.findAll(['a', 'dd']):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

當我打印出list_of_rows時，我在循環中為每個傳遞獲得以下輸出：

[價錢]

[標題]，[作者]，[發行]，[等等]，[等等]，[等等]

[price]來自“product-image”div塊。 [標題]等。 來自“產品細節”div塊。

所以基本上，findAll和我寫的循環為每個div塊輸出不同的行，我正在循環。 我想得到的結果是兩個塊的單行輸出，如下所示：

[價格]，[標題]，[作者]，[發行]，[等等]，[等等]，[等等]

有沒有辦法在我當前的流程中執行此操作，或者我是否需要將其分解為多個循環，單獨提取數據，然后合並？ 我已經瀏覽了StackOverflow和其他站點上的所有問答，雖然我可以找到具有多個類的findAll循環實例，但我找不到任何如何將輸出減少到單行的示例。

這是我正在解析的網頁的片段。 這個片段在我正在解析的html中出現1次x，其中x是頁面上的產品數量：

<div class="product-image">
    <a class="thumb" href="/Store/Details/life-on-the-screen/_/R-9780684833484B"><img src="http://images.bookdepot.com/covers/large/isbn978068/9780684833484-l.jpg" alt="" class="cover" />
        <div class="price "><span>$</span>2.25
        </div>
    </a>
</div>

<div class="product-details">
    <dl>
        <dt><div class="nowrap"><span><a href="/Store/Details/life-on-the-screen/_/R-9780684833484B" title="Life On The Screen">Life On The Screen</a></span></div></dt>
        <dd class="type"><div class="nowrap"><span><a href="/Store/Browse/turkle-sherry/_/N-4294697489/Ne-4">Turkle, Sherry</a></span></div></dd>
        <dd class="type"><div class="nowrap"><a href="/Store/Browse/simon-and-schuster/_/N-4294151338/Ne-5">Simon and Schuster</a></div></dd>
        <dd class="type">(Paperback)</dd>
        <dd class="type">Computers &amp; Internet</dd>
        <dd class="type">ISBN: 9780684833484</dd>
        <dd>List $15.00 - Qty: 9</dd>
           </dl>
</div>

任何指針或幫助非常感謝！

Answer 1

從你的問題，我想出了2個結果..我不確定你在找什么...所以我發布了兩個案例

第一種情況 - 擴展列表而不是附加它

from bs4 import BeautifulSoup
data = """<div class="product-image">
    <a class="thumb" href="/Store/Details/life-on-the-screen/_/R-9780684833484B"><img src="http://images.bookdepot.com/covers/large/isbn978068/9780684833484-l.jpg" alt="" class="cover" />
        <div class="price "><span>$</span>2.25
        </div>
    </a>
</div>

<div class="product-details">
    <dl>
        <dt><div class="nowrap"><span><a href="/Store/Details/life-on-the-screen/_/R-9780684833484B" title="Life On The Screen">Life On The Screen</a></span></div></dt>
        <dd class="type"><div class="nowrap"><span><a href="/Store/Browse/turkle-sherry/_/N-4294697489/Ne-4">Turkle, Sherry</a></span></div></dd>
        <dd class="type"><div class="nowrap"><a href="/Store/Browse/simon-and-schuster/_/N-4294151338/Ne-5">Simon and Schuster</a></div></dd>
        <dd class="type">(Paperback)</dd>
        <dd class="type">Computers &amp; Internet</dd>
        <dd class="type">ISBN: 9780684833484</dd>
        <dd>List $15.00 - Qty: 9</dd>
           </dl>
</div>"""

soup = BeautifulSoup(data,'lxml')

list_of_rows = []
for row in soup.findAll(True, {"class":["product-image", "product-details"]}):
    list_of_cells = []
    for cell in row.findAll(['a', 'dd']):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.extend(list_of_cells)
print list_of_rows

產量

[u'\n$2.25\n        \n', u'Life On The Screen', u'Turkle, Sherry', u'Turkle, Sherry', u'Simon and Schuster', u'Simon and Schuster', u'(Paperback)', u'Computers & Internet', u'ISBN: 9780684833484', u'List $15.00 - Qty: 9']

第二種情況 - 您需要從html文本中刪除新行字符

list_of_rows = []
for row in soup.findAll(True, {"class":["product-image", "product-details"]}):
    list_of_cells = []
    for cell in row.findAll(['a', 'dd']):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text.strip())
    list_of_rows.append(list_of_cells)
print list_of_rows

產量

[[u'$2.25'], [u'Life On The Screen', u'Turkle, Sherry', u'Turkle, Sherry', u'Simon and Schuster', u'Simon and Schuster', u'(Paperback)', u'Computers & Internet', u'ISBN: 9780684833484', u'List $15.00 - Qty: 9']]

使用BeautifulSoup findAll將多行輸出組合成一行，具有多個類/標簽

問題描述

1 個解決方案

解決方案1
0 2017-01-26 20:01:23

使用BeautifulSoup findAll將多行輸出組合成一行，具有多個類/標簽

問題描述

1 個解決方案

解決方案1 0 2017-01-26 20:01:23

解決方案1
0 2017-01-26 20:01:23