简体   繁体   中英

get text in div after specific character [xpath]

I am using XPath to scrape a website, I've been able to access more of the information I need except for the date. The date is text in a div, it is formatted as such below.

October 13, 2018 / 1:31 AM / Updated 5 hours ago

I just want to get the date, not the time or anything else. However, with my current code, I am getting the entire text in the div. My code is below.

item['datePublished'] = response.xpath("//div[contains(@class, 'ArticleHeader_date') and substring-before(., '/')]/text()").extract()

As hinted, there are ways to do this in XPath 2.0+. However, this should be done in the host language.

One way is to extract the date using a regex after the value has been retrieved, eg Regex Demo

\w+\ \d\d?,\ \d{4}

Code Sample :

import re
regex = r"\w+\ \d\d?,\ \d{4}"
test_str = "October 13, 2018 / 1:31 AM / Updated 5 hours ago"
matches = re.search(regex, test_str)
if matches:
    print (matches.group())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM