[英]beautiful soup automatically converting string to time format?
I'm trying to scrape a div which has 'time' information from a website (using beautifulsoup + selenium):我正在尝试从网站上抓取具有“时间”信息的 div(使用 beautifulsoup + selenium):
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--window-size=1420,1080')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument("--disable-notifications")
options.add_experimental_option('useAutomationExtension', False)
options.binary_location='/usr/bin/google-chrome-stable'
chrome_driver_binary = "/usr/bin/chromedriver"
driver = webdriver.Chrome(chrome_driver_binary,
chrome_options=options)
#Set base url (San Francisco)
base_url = 'https://www.bandsintown.com/?place_id=ChIJIQBpAG2ahYAR_6128GcTUEo&page='
events = []
eventContainerBucket = []
for i in range(1,35):
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
print(pageURL)
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=_3buUBPWBhUz9KBQqgXm-gf] a[class^=_3UX9sLQPbNUbfbaigy35li]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
# iterate through all events and open them.
item = {}
allEvents = []
for event in events:
soup = bs(driver.find_element_by_css_selector('[class^=Y_sOCKLIZzxDZWauPTJlk]').get_attribute('outerHTML'))
soup2 = bs(driver.find_element_by_css_selector('[class^=_2j34xcqD4slSOyTCMbA1dY]').get_attribute('outerHTML'))
# Get time
time = soup.select_one('img + div + div').text
print(time)
This keeps converting time to UTC when I don't want it to.当我不希望它时,这会不断将时间转换为 UTC。 I just want to pull the raw text for each time, ie 9:00 PM.
我只想提取每次的原始文本,即晚上 9:00。 I've tried parsing the raw string right away, so it just grabs the string:
我已经尝试立即解析原始字符串,所以它只是抓住了字符串:
time = soup.select_one('img + div + div').text
' '.join(time.split(' ')[0:2])
#time.replace('UTC','')
print(time)
But it's still printing out with UTC, ie 2:00 AM UTC.但它仍然使用 UTC 打印,即 UTC 凌晨 2:00。
Is there a way to pull just the raw string, before it's automatically converted to a time?有没有办法在它自动转换为时间之前只提取原始字符串? I don't want to deal with time zones, and I don't think I need to for this project.
我不想处理时区,我认为我不需要这个项目。 Just want raw string.
只想要原始字符串。
I am not sure why you are using Beautiful Soup select
.我不确定你为什么使用 Beautiful Soup
select
。 Can you just get the text of the element using Selenium?您可以使用 Selenium 获取元素的文本吗?
for event in events:
# using locator from your example below, although it did not work for me
element = driver.find_element_by_css_selector('[class^=Y_sOCKLIZzxDZWauPTJlk]')
# Get time
time = element.text
print(time)
Output: Output:
6:00 PM PDT
Not sure this is what you are looking for, but hopefully this is helpful.不确定这是您正在寻找的,但希望这会有所帮助。 Good luck!
祝你好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.