get text in div after specific character [xpath]

Question

I am using XPath to scrape a website, I've been able to access more of the information I need except for the date. The date is text in a div, it is formatted as such below.

October 13, 2018 / 1:31 AM / Updated 5 hours ago

I just want to get the date, not the time or anything else. However, with my current code, I am getting the entire text in the div. My code is below.

item['datePublished'] = response.xpath("//div[contains(@class, 'ArticleHeader_date') and substring-before(., '/')]/text()").extract()

Answer 1

As hinted, there are ways to do this in XPath 2.0+. However, this should be done in the host language.

One way is to extract the date using a regex after the value has been retrieved, eg Regex Demo

\w+\ \d\d?,\ \d{4}

Code Sample :

import re
regex = r"\w+\ \d\d?,\ \d{4}"
test_str = "October 13, 2018 / 1:31 AM / Updated 5 hours ago"
matches = re.search(regex, test_str)
if matches:
    print (matches.group())

get text in div after specific character [xpath]

Question

1 answers

solution1
0 ACCPTED 2018-10-19 05:26:43

get text in div after specific character [xpath]

Question

1 answers

solution1 0 ACCPTED 2018-10-19 05:26:43

solution1
0 ACCPTED 2018-10-19 05:26:43