简体   繁体   English

美丽的汤自动将字符串转换为时间格式?

[英]beautiful soup automatically converting string to time format?

I'm trying to scrape a div which has 'time' information from a website (using beautifulsoup + selenium):我正在尝试从网站上抓取具有“时间”信息的 div(使用 beautifulsoup + selenium):

options = webdriver.ChromeOptions() 
options.add_argument('--no-sandbox')
options.add_argument('--window-size=1420,1080')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument("--disable-notifications")
options.add_experimental_option('useAutomationExtension', False)
options.binary_location='/usr/bin/google-chrome-stable'
chrome_driver_binary = "/usr/bin/chromedriver"
driver = webdriver.Chrome(chrome_driver_binary, 
chrome_options=options)

#Set base url (San Francisco)
base_url = 'https://www.bandsintown.com/?place_id=ChIJIQBpAG2ahYAR_6128GcTUEo&page='


events = []
eventContainerBucket = []

for i in range(1,35):
    #cycle through pages in range
    driver.get(base_url + str(i))
    pageURL = base_url + str(i)
    print(pageURL)

    # get events links
    event_list = driver.find_elements_by_css_selector('div[class^=_3buUBPWBhUz9KBQqgXm-gf] a[class^=_3UX9sLQPbNUbfbaigy35li]')
    # collect href attribute of events in even_list
    events.extend(list(event.get_attribute("href") for event in event_list))


# iterate through all events and open them.
item = {}
allEvents = []
for event in events:

      soup = bs(driver.find_element_by_css_selector('[class^=Y_sOCKLIZzxDZWauPTJlk]').get_attribute('outerHTML'))
      soup2 = bs(driver.find_element_by_css_selector('[class^=_2j34xcqD4slSOyTCMbA1dY]').get_attribute('outerHTML'))


        # Get time
        time = soup.select_one('img + div + div').text
        print(time)

This keeps converting time to UTC when I don't want it to.当我不希望它时,这会不断将时间转换为 UTC。 I just want to pull the raw text for each time, ie 9:00 PM.我只想提取每次的原始文本,即晚上 9:00。 I've tried parsing the raw string right away, so it just grabs the string:我已经尝试立即解析原始字符串,所以它只是抓住了字符串:

time = soup.select_one('img + div + div').text
' '.join(time.split(' ')[0:2])
#time.replace('UTC','')

print(time)

But it's still printing out with UTC, ie 2:00 AM UTC.但它仍然使用 UTC 打印,即 UTC 凌晨 2:00。

Is there a way to pull just the raw string, before it's automatically converted to a time?有没有办法在它自动转换为时间之前只提取原始字符串? I don't want to deal with time zones, and I don't think I need to for this project.我不想处理时区,我认为我不需要这个项目。 Just want raw string.只想要原始字符串。

I am not sure why you are using Beautiful Soup select .我不确定你为什么使用 Beautiful Soup select Can you just get the text of the element using Selenium?您可以使用 Selenium 获取元素的文本吗?

for event in events:
    # using locator from your example below, although it did not work for me
    element = driver.find_element_by_css_selector('[class^=Y_sOCKLIZzxDZWauPTJlk]')

    # Get time
    time = element.text
    print(time)

Output: Output:

6:00 PM PDT

Not sure this is what you are looking for, but hopefully this is helpful.不确定这是您正在寻找的,但希望这会有所帮助。 Good luck!祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM