The following is an example of the HTML code I want to parse:
<td style="padding-left:5px;" title="col1 : val1
col2 : val2">
There are several rows. I am using beautiful soup to parse the HTML code by selecting 'td' as follows
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
tbody=table.select_one('tbody')
tds = tbody.find_all("td")
In this example, How can extract col1=val1, col2=val2 as dataframe? It is hard to me, because the values is in attrs.
col1 col2
==========
val1 val2
val1-2 val2-2
.
.
.
I try this
tds.attrs['title']
but my code is not working.
Give me hint..
Based on the data you have provided , I have used re
to separate out the title
value, put them in a dict
and converted it to a dataframe.
I have added an extra <td>
to simulate getting data from all the <td>
.
import re
import pandas as pd
from bs4 import BeautifulSoup
s = '''<td style="padding-left:5px;" title="col1 : val1 col2 : val2">Data1</td>
<td style="padding-left:5px;" title="col1 : val3 col2 : val4">Data2</td>'''
soup = BeautifulSoup(s, 'lxml')
tds = soup.find_all('td')
d = {'col1': [], 'col2': []}
for i in tds:
title = i['title'].strip()
f = re.findall(r'col1\s:\s(.*)\scol2\s:\s(.*?)$',title)[0]
d['col1'].append(f[0])
d['col2'].append(f[1])
df = pd.DataFrame(d)
print(df)
col1 col2
0 val1 val2
1 val3 val4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.