[英]How should I go about scraping the text in dd tags between specific dt tags on a page using BeautifulSoup?
我正在尝试从 dd 标签(用于标记不同日期)之间的 dd 类中提取文本。 我尝试了一种非常hackey的方法,但效果不够一致
timeDiv = mezzrowSource.find_all("dd", class_="orange event-date")
eventDiv = mezzrowSource.find_all("dd", class_="event")
index = 0
for time in timeDiv:
returnValue[timeDiv[index].text] = eventDiv[index].text.strip()
if "8" in timeDiv[index+3].text or "4:30" in timeDiv[index+3].text:
break
index += 1
以这种方式枚举在大多数情况下会导致来自 otherorked 的文本过多,但有时会从其他日期提取事件。 这里有问题的部分的来源粘贴在下面。 有任何想法吗?
<dt class="purple">Sun, September 30th, 2018</dt>
<dd class="orange event-date">4:30 PM to 7:00 PM</dd>
<dd class="event"><a href="/events/4094-mezzrow-classical-salon-with-david-oei"
class="event-title">Mezzrow Classical Salon with David Oei</a>
</dd>
<dd class="orange event-date">8:00 PM to 10:30 PM</dd>
<dd class="event"><a href="/events/4144-luke-sellick-ron-blake-adam-birnbaum"
class="event-title">Luke Sellick, Ron Blake & Adam Birnbaum</a>
</dd>
<dd class="orange event-date">11:00 PM to 1:00 AM</dd>
<dd class="event"><a href="/events/4099-ryo-sasaki-friends-after-hours"
class="event-title">Ryo Sasaki & Friends "After-hours"</a>
</dd>
<dt class="purple">Mon, October 1st, 2018</dt>
<dd class="orange event-date">8:00 PM to 10:30 PM</dd>
<dd class="event"><a href="/events/4137-greg-ruggiero-murray-wall-steve-little"
class="event-title">Greg Ruggiero, Murray Wall & Steve Little</a>
</dd>
<dd class="orange event-date">11:00 PM to 1:00 AM</dd>
<dd class="event"><a href="/events/4174-pasquale-grasso-after-hours"
class="event-title">Pasquale Grasso "After-hours"</a>
</dd>
预期输出是一个看起来像这样的字典: {'4:30 PM to 7:00 PM': 'Mezzrow Classical Salon with David Oei', '8:00 PM to 10:30 PM': 'Greg Ruggiero, Murray Wall & Steve Little','晚上 11:00 至凌晨 1:00':'Pasquale Grasso“下班后”'}
如果我正确理解了问题,则可以使用zip():
mezzrowSource = BeautifulSoup(html , 'lxml')
timeDiv = [tag.get_text() for tag in mezzrowSource.find_all("dd", class_="orange event-date")]
eventDiv = [tag.get_text().strip() for tag in mezzrowSource.find_all("dd", class_="event")]
print(dict(zip(timeDiv, eventDiv)))
输出:
{'4:30 PM to 7:00 PM': 'Mezzrow Classical Salon with David Oei', '8:00 PM to 10:30 PM': 'Greg Ruggiero, Murray Wall & Steve Little', '11:00 PM to 1:00 AM': 'Pasquale Grasso "After-hours"'}
更新:
您要从中获取数据的元素都是同级,即没有元素包含每组数据,这使得按需要对数据进行分组变得更加困难。 您唯一喜欢的事实是带有日期的元素首先出现,然后是时间,然后是标题。 时间和标题可以重复。 因此,此方法选择了我们想要的所有元素并对其进行迭代。 在第一次迭代中,它将日期存储在字符串中,并创建一个包含时间和标题的元组列表。 下次找到日期时,会将日期和元组列表追加到字典中。 在迭代结束时,它将最终日期和元组列表追加到字典中。 有点混乱,但这是由于HTML中缺乏结构。
from bs4 import BeautifulSoup
import requests
import re
import pprint
url = 'https://www.mezzrow.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text , 'lxml')
ds = soup.find_all(True, {'class': re.compile('purple|event|orange event_date')})
ret = {}
tmp = []
i = None
for d in ds:
if d.attrs['class']==['purple']:
if i is not None:
ret[i] = tmp
tmp = []
i = (d.get_text())
elif d.attrs['class']==['orange', 'event-date']:
j = d.get_text()
elif d.attrs['class']==['event']:
tmp.append ((j,d.get_text(strip=True)))
ret[i] = tmp
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(ret)
输出:
{'Fri, October 12th, 2018': [('8:00 PM to 10:30 PM',
'Rossano Sportiello, Pasquale Grasso & Frank '
'Tate'),
('11:00 PM to 2:00 AM',
'Ben Paterson "After-hours"')],
'Fri, October 5th, 2018': [('8:00 PM to 10:30 PM',
'Vanessa Rubin, Brandon McCune, Kenny Davis & '
'Winard Harper'),
('11:00 PM to 2:00 AM',
'Joe Davidian "After-hours"')],
'Mon, October 1st, 2018': [('8:00 PM to 10:30 PM',
'Greg Ruggiero, Murray Wall & Steve Little'),
('11:00 PM to 1:00 AM',
'Pasquale Grasso "After-hours"')],
....
然后从dict对象中选择所需的日期。
您可以访问此页面获取我编写的全新HTML Scrape软件包(Java)。 Java比Python更好,如果您不同意,则取决于您!
在此处下载: http : //developer.torello.directory/JavaHTML/index.html
import Torello.HTML.*;
import Torello.Java.*;
import java.util.*;
import java.util.regex.*;
import java.io.*;
public class ScrapeDD
{
public static void main(String[] argv) throws IOException
{
Pattern P = Tags.getPattern("dd", "class");
String ddData = FileRW.loadFileToString("DDData.html");
Vector<HTMLNode> page = HTMLPage.getPageTokens(ddData, false);
int ddPos = -1;
while (true)
{
ddPos = TagNodeFind.first(page, ddPos + 1, -1, TC.OpeningTags, "dd");
if (ddPos == -1) break;
Vector<HTMLNode> ddPair = TagNodeGet.firstInclusive(page, ddPos, -1, "dd");
System.out.println("DD.class = " + Tags.getInnerTagValue((TagNode) page.elementAt(ddPos), P));
for (HTMLNode n : ddPair)
if (n instanceof TextNode) if (n.str.trim().length() > 0)
System.out.println(Escape.replaceAll(n.str));
}
}
}
Produces this output: DD.class = orange event-date 4:30 PM to 7:00 PM DD.class = event Mezzrow Classical Salon with David Oei DD.class = orange event-date 8:00 PM to 10:30 PM DD.class = event Luke Sellick, Ron Blake & Adam Birnbaum DD.class = orange event-date 11:00 PM to 1:00 AM DD.class = event Ryo Sasaki & Friends "After-hours" DD.class = orange event-date 8:00 PM to 10:30 PM DD.class = event Greg Ruggiero, Murray Wall & Steve Little DD.class = orange event-date 11:00 PM to 1:00 AM DD.class = event Pasquale Grasso "After-hours"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.