简体   繁体   English

如何使用 Beautiful Soup 在 HTML 中找到下一个文本实例?

[英]How to find the next instance of text in HTML using Beautiful Soup?

I'm writing a program that looks up the National Food Holiday for the current day using this site: https://foodimentary.com/today-in-national-food-holidays/may-holidays/ .我正在编写一个程序,使用这个网站查找当天的全国美食节: https://foodimentary.com/today-in-national-food-holidays/may-holidays/

So far I've been able to consistently get the tag with the current date, but I'm having trouble using that as a base reference to get the associated Food Day.到目前为止,我已经能够始终如一地获得带有当前日期的标签,但是我在使用它作为获取相关食品日的基本参考时遇到了麻烦。 Here's what I have so far:这是我到目前为止所拥有的:

month = date.today().strftime('%b') # Get month
day = date.today().strftime('%d') # Get day
date = f'{month.lower()}-{day}' # Format date 

# Get HTML from home page
url = 'https://foodimentary.com/today-in-national-food-holidays/todayinfoodhistorycalenderfoodnjanuary/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser') # Parse HTML with Beautiful Soup

# Get the current month URL
months = soup.find('ul', id='menu-months', class_='menu') # Isolate the months table
monthUrl = months.find('a', href=True, string=month)['href'] # Get the month URL for the current month

# Get HTML from month page, parse
r = requests.get(monthUrl)
soup = BeautifulSoup(r.text, 'html.parser')

# Find tag with URL that contains formatted date
holidayTag = soup.select_one(f'a[href*={date}]')
print(holidayTag)

# TODO: Get the name of the food day based on holidayTag

Using my browser's developer console, it seems like the most consistent pattern to correlate date with name of food holiday is that the holiday is always the next instance of text after the date tag.使用我的浏览器的开发者控制台,将日期与食物假期名称相关联的最一致的模式似乎是假期始终是日期标签之后的下一个文本实例。 Here's an example piece of HTML:这是 HTML 的示例:



<div style="text-align:center;">
   <strong><a title="May&nbsp;29" href="https://foodimentaryguy.wordpress.com/2011/05/29/may-29/">May 29</a></strong><br>
   <span style="color:#000000;"><a style="color:#000000;" href="https://foodimentary.com/2017/02/12/february-12th-is-national-biscotti-day/">National Biscuit Day</a></span>
   <div style="text-align:center;"><strong><a title="May&nbsp;28" href="https://foodimentaryguy.wordpress.com/2011/05/28/may-28/">May 28</a></strong><br>
      <span style="color:#000000;"><a style="color:#000000;" href="https://foodimentary.com/2016/05/28/may-28-is-national-brisket-day/">National Brisket Day</a></span>
   </div>
</div>

My question is: how can I use Beautiful Soup to get the name of the holiday from the date tag?我的问题是:如何使用 Beautiful Soup 从日期标签中获取假期名称?

This text is very unstructured (most probably written by hand and not machine generated).该文本非常非结构化(很可能是手写而不是机器生成的)。 I recommend using re module for main parsing:我建议使用re模块进行主解析:

import re
from bs4 import BeautifulSoup

url = 'https://foodimentary.com/today-in-national-food-holidays/may-holidays/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
txt = soup.select_one('section[role="main"]').text

out = {}
for day, names in re.findall(r'^([A-Z][^\n]+\d\s*)$(.*?)\n\n', txt, flags=re.DOTALL|re.M):
    out[day.strip()] = [name.replace('\xa0', ' ') for name in names.strip().split('\n')]

# pretty print on screen:
from pprint import pprint
pprint(out)

Prints:印刷:

{'May 1': ['National Chocolate Parfait Day'],
 'May 10': ['National Liver and Onions Day'],
 'May 11': ['National “Eat What You Want” Day'],
 'May 12': ['National Nutty Fudge Day'],
 'May 13': ['National Apple Pie Day',
            'National Fruit Cocktail Day',
            'National Hummus Day'],
 'May 14': ['National Brioche Day', 'National Buttermilk Biscuit Day'],
 'May 15': ['National Chocolate Chip Day'],
 'May 16': ['National Barbecue Day'],
 'May 17': ['National Cherry Cobbler Day'],
 'May 18': ['National Cheese Souffle Day', 'I love Reese’s Day'],
 'May 19': ['National Devil’s Food Cake Day'],
 'May 2': ['National Chocolate Truffle Day'],
 'May 20': ['National Quiche Lorraine Day', 'National Pick Strawberries Day'],
 'May 21': ['National Strawberries and Cream Day'],
 'May 22': ['National Vanilla Pudding Day'],
 'May 23': ['National Taffy Day'],
 'May 24': ['National Escargot Day'],
 'May 25': ['National Brown-Bag-It Day', 'National Wine Day'],
 'May 26': ['National Blueberry Cheesecake Day', 'National Cherry Dessert Day'],
 'May 27': ['National Italian Beef Day', 'National Grape Popsicle Day'],
 'May 28': ['National Brisket Day'],
 'May 29': ['National Biscuit Day'],
 'May 3': ['National Raspberry Popover Day',
           'National Raspberry Tart Day',
           'National Chocolate Custard Day'],
 'May 30': ['National Mint Julep Day'],
 'May 31': ['National Macaroon Day'],
 'May 4': ['National Candied Orange Peel Day',
           'National Homebrew Day',
           'National Hoagie Day'],
 'May 5': ['National Enchilada Day – Happy Cinco de Mayo!'],
 'May 6': ['National Crepe Suzette Day'],
 'May 7': ['National Roast Leg of Lamb Day'],
 'May 8': ['National Coconut Cream Pie Day'],
 'May 9': ['National Shrimp Day', 'National Foodies Day*']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM