简体   繁体   English

BeautifulSoup和正则表达式-从标签中提取文本

[英]BeautifulSoup and Regular Expressions - extracting text from tags

I'm writing a small text scraping script with Python. 我正在用Python编写一个小的文本抓取脚本。 It's my first bigger project so I have some problems. 这是我的第一个更大的项目,所以我遇到了一些问题。 I'm using urllib2 and BeautifulSoup. 我正在使用urllib2和BeautifulSoup。 I want to scrape song names from one playlist. 我想从一个播放列表中抓取歌曲名称。 I can get one song name or all song names + other strings that I don't need. 我可以获得一个歌曲名称或所有歌曲名称以及其他不需要的字符串。 I can't manage to get only all song names. 我无法仅获取所有歌曲名称。 My code that gets all song names + other strings that I don't need: 我的代码获取所有不需要的歌曲名称和其他字符串:

import urllib2
from bs4 import BeautifulSoup
import re

response = urllib2.urlopen('http://guardsmanbob.com/media/playlist.php?char=a').read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr')[0]:
    for td in soup.findAll('a'):
        print td.contents[0]

And code which gives me one song: 以及给我一首歌的代码:

print soup.findAll('tr')[1].findAll('a')[0].contents[0]

It's actually not a loop so I can't get no more than one, but if I try to make it loop, I got like 10 same song names. 它实际上不是循环,所以我最多只能听到一个循环,但是如果我尝试使其循环播放,我会得到10个相同的歌曲名称。 That code: 该代码:

for tr in soup.findAll('tr')[1]:
    for td in soup.findAll('td')[0]:
        print td.contents[0]

I'm stuck for a day now and I can't get it working. 我现在呆了一天,无法正常工作。 I don't understand how does these things work. 我不明白这些东西是如何工作的。

for tr in soup.findAll('tr'):  # 1
    if not tr.find('td'): continue  # 2
    for td in tr.find('td').findAll('a'):  # 3
        print td.contents[0]
  1. You want to iterate over all tr's, hence findAll('tr') instead of findAll('tr') [0] . 您想遍历所有tr,因此使用findAll('tr')而不是findAll('tr') [0]
  2. Some rows don't contain td, so we need to skip them to avoid AttributeError (try removing this line) 有些行不包含td,因此我们需要跳过它们以避免AttributeError(尝试删除此行)
  3. As in 1, you want all a's in first td, but also " for td in tr.find ", not " for td in soup.find ", because you want to look in tr 's not in the whole document ( soup ). 如1,你要全部在第一个TD,而且“ for td in tr.find ”,而不是“ for td in soup.find ”,因为你想看看tr的不是整个文件( soup ) 。

You should be a little more specific in your search, then just loop over the table rows; 您应该在搜索中更具体一些,然后循环遍历表行; grab the specific table by css class, loop over the tr elements except the first one using slicing, grab all text from the first td : 通过css类获取特定表,使用切片在第一个元素之外的tr元素上循环,从第一个td获取所有文本:

table = soup.find('table', class_='data-table')
for row in table.find_all('tr')[1:]:
    print ''.join(row.find('td').stripped_strings)

Alternatively to slicing off the first row, you can skip the thead by testing for that: 除了切掉第一行之外,您还可以通过测试以下内容跳过thead

for row in table.find_all('tr'):
    if row.parent.name == 'thead':
        continue
    print ''.join(row.find('td').stripped_strings)

It would have been better all around if the page had used a proper <tbody> tag instead. 如果页面改用正确的<tbody>标记,那就更好了。 :-) :-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM