I am using XPath to scrape a website, I've been able to access more of the information I need except for the date. The date is text in a div, it is formatted as such below.
October 13, 2018 / 1:31 AM / Updated 5 hours ago
I just want to get the date, not the time or anything else. However, with my current code, I am getting the entire text in the div. My code is below.
item['datePublished'] = response.xpath("//div[contains(@class, 'ArticleHeader_date') and substring-before(., '/')]/text()").extract()
As hinted, there are ways to do this in XPath 2.0+. However, this should be done in the host language.
One way is to extract the date using a regex after the value has been retrieved, eg Regex Demo
\w+\ \d\d?,\ \d{4}
import re
regex = r"\w+\ \d\d?,\ \d{4}"
test_str = "October 13, 2018 / 1:31 AM / Updated 5 hours ago"
matches = re.search(regex, test_str)
if matches:
print (matches.group())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.