[英]BeautifulSoup and Regular Expressions - extracting text from tags
I'm writing a small text scraping script with Python. 我正在用Python编写一个小的文本抓取脚本。 It's my first bigger project so I have some problems.
这是我的第一个更大的项目,所以我遇到了一些问题。 I'm using urllib2 and BeautifulSoup.
我正在使用urllib2和BeautifulSoup。 I want to scrape song names from one playlist.
我想从一个播放列表中抓取歌曲名称。 I can get one song name or all song names + other strings that I don't need.
我可以获得一个歌曲名称或所有歌曲名称以及其他不需要的字符串。 I can't manage to get only all song names.
我无法仅获取所有歌曲名称。 My code that gets all song names + other strings that I don't need:
我的代码获取所有不需要的歌曲名称和其他字符串:
import urllib2
from bs4 import BeautifulSoup
import re
response = urllib2.urlopen('http://guardsmanbob.com/media/playlist.php?char=a').read()
soup = BeautifulSoup(response)
for tr in soup.findAll('tr')[0]:
for td in soup.findAll('a'):
print td.contents[0]
And code which gives me one song: 以及给我一首歌的代码:
print soup.findAll('tr')[1].findAll('a')[0].contents[0]
It's actually not a loop so I can't get no more than one, but if I try to make it loop, I got like 10 same song names. 它实际上不是循环,所以我最多只能听到一个循环,但是如果我尝试使其循环播放,我会得到10个相同的歌曲名称。 That code:
该代码:
for tr in soup.findAll('tr')[1]:
for td in soup.findAll('td')[0]:
print td.contents[0]
I'm stuck for a day now and I can't get it working. 我现在呆了一天,无法正常工作。 I don't understand how does these things work.
我不明白这些东西是如何工作的。
for tr in soup.findAll('tr'): # 1
if not tr.find('td'): continue # 2
for td in tr.find('td').findAll('a'): # 3
print td.contents[0]
findAll('tr')
instead of findAll('tr')
[0]
. findAll('tr')
而不是findAll('tr')
[0]
。 for td in tr.find
", not " for td in soup.find
", because you want to look in tr
's not in the whole document ( soup
). for td in tr.find
”,而不是“ for td in soup.find
”,因为你想看看tr
的不是整个文件( soup
) 。 You should be a little more specific in your search, then just loop over the table rows; 您应该在搜索中更具体一些,然后循环遍历表行; grab the specific table by css class, loop over the
tr
elements except the first one using slicing, grab all text from the first td
: 通过css类获取特定表,使用切片在第一个元素之外的
tr
元素上循环,从第一个td
获取所有文本:
table = soup.find('table', class_='data-table')
for row in table.find_all('tr')[1:]:
print ''.join(row.find('td').stripped_strings)
Alternatively to slicing off the first row, you can skip the thead
by testing for that: 除了切掉第一行之外,您还可以通过测试以下内容跳过
thead
:
for row in table.find_all('tr'):
if row.parent.name == 'thead':
continue
print ''.join(row.find('td').stripped_strings)
It would have been better all around if the page had used a proper <tbody>
tag instead. 如果页面改用正确的
<tbody>
标记,那就更好了。 :-) :-)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.