I am using python bs4 to extract the date and time from these HTML code
[ < time class="published-date relative-date" data-published-date="2020-07-21T18:49:14Z" datetime="2020-07-21T18:49:14Z" > < /time >, < time class="published-date relative-date" data-published-date="2020-07-21T18:48:26Z" datetime="2020-07-21T18:48:26Z" >< / time>, < time class="published-date relative-date" data-published-date="2020-07-21T18:47:00Z" datetime="2020-07-21T18:47:00Z"></ time>, < time class="published-date relative-date" data-published-date="2020-07-21T18:43:21Z" datetime="2020-07-21T18:43:21Z"> </ time>]
**
and was wondering how I can get rid of the other text aside from the date and time? For example, '2020-07-21T18:49:14Z' and have it displayed as '2020-07-21', '18:49:14Z'
Here is my code so far:
date_and_time=soup.find_all('time', attrs={'class':'published-date relative-date'})
You can use
soup.find(id=<ID OF TIME>)
Then you will only get the time. If you are using find_all, you will get all text that matches the attributes.
You can also just split the text you have right now:
date_and_time = '2020-07-21T18:49:14Z'
print(date_and_time.split('T')
['2020-07-21', '18:49:14Z']
This script will create pandas dataframe with time
and date
columns:
import pandas as pd
from bs4 import BeautifulSoup
html_string = '''
<time class="published-date relative-date" data-published-date="2020-07-21T18:49:14Z" datetime="2020-07-21T18:49:14Z"></time>
'''
soup = BeautifulSoup(html_string, 'html.parser')
all_data = []
for t in soup.select('time.published-date.relative-date'):
all_data.append(t.get('data-published-date'))
df = pd.DataFrame(all_data)
df[0] = pd.to_datetime(df[0])
df['date'] = df[0].dt.date
df['time'] = df[0].dt.time
print(df)
Prints:
0 date time
0 2020-07-21 18:49:14+00:00 2020-07-21 18:49:14
You can use dateutil
to parse the raw date-time string. Install dateutil using pip using the command pip install python-dateutil
from bs4 import BeautifulSoup
from dateutil import parser
text = '<time class="published-date relative-date" date-published-date="2020-07-21T18:49:14Z" datetime="2020-07-21T18:49:14Z">'
soup = BeautifulSoup(text)
for t in soup.find_all('time', attrs={'class':'published-date relative-date'}):
date_time_str = t.get('datetime')
date_time = parser.parse(date_time_str)
print (date_time.date())
print (date_time.time())
Outputs:
2020-07-21
18:49:14
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.