BeautifulSoup和正則表達式-從標簽中提取文本

Question

我正在用Python編寫一個小的文本抓取腳本。 這是我的第一個更大的項目，所以我遇到了一些問題。 我正在使用urllib2和BeautifulSoup。 我想從一個播放列表中抓取歌曲名稱。 我可以獲得一個歌曲名稱或所有歌曲名稱以及其他不需要的字符串。 我無法僅獲取所有歌曲名稱。 我的代碼獲取所有不需要的歌曲名稱和其他字符串：

import urllib2
from bs4 import BeautifulSoup
import re

response = urllib2.urlopen('http://guardsmanbob.com/media/playlist.php?char=a').read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr')[0]:
    for td in soup.findAll('a'):
        print td.contents[0]

以及給我一首歌的代碼：

print soup.findAll('tr')[1].findAll('a')[0].contents[0]

它實際上不是循環，所以我最多只能聽到一個循環，但是如果我嘗試使其循環播放，我會得到10個相同的歌曲名稱。 該代碼：

for tr in soup.findAll('tr')[1]:
    for td in soup.findAll('td')[0]:
        print td.contents[0]

我現在呆了一天，無法正常工作。 我不明白這些東西是如何工作的。

Answer 1

for tr in soup.findAll('tr'):  # 1
    if not tr.find('td'): continue  # 2
    for td in tr.find('td').findAll('a'):  # 3
        print td.contents[0]

您想遍歷所有tr，因此使用findAll('tr')而不是findAll('tr') [0] 。
有些行不包含td，因此我們需要跳過它們以避免AttributeError（嘗試刪除此行）
如1，你要全部在第一個TD，而且“ for td in tr.find ”，而不是“ for td in soup.find ”，因為你想看看tr的不是整個文件（ soup ）。

Answer 2

您應該在搜索中更具體一些，然后循環遍歷表行； 通過css類獲取特定表，使用切片在第一個元素之外的tr元素上循環，從第一個td獲取所有文本：

table = soup.find('table', class_='data-table')
for row in table.find_all('tr')[1:]:
    print ''.join(row.find('td').stripped_strings)

除了切掉第一行之外，您還可以通過測試以下內容跳過thead ：

for row in table.find_all('tr'):
    if row.parent.name == 'thead':
        continue
    print ''.join(row.find('td').stripped_strings)

如果頁面改用正確的<tbody>標記，那就更好了。 :-)

BeautifulSoup和正則表達式-從標簽中提取文本

問題描述

2 個解決方案

解決方案1
1 已采納 2013-01-24 18:28:58

解決方案2
1 2013-01-24 18:39:13

BeautifulSoup和正則表達式-從標簽中提取文本

問題描述

2 個解決方案

解決方案1 1 已采納 2013-01-24 18:28:58

解決方案2 1 2013-01-24 18:39:13

解決方案1
1 已采納 2013-01-24 18:28:58

解決方案2
1 2013-01-24 18:39:13