简体   繁体   English

Python 美汤 select 属性中的文字

[英]Python beautiful soup select text in attrs

The following is an example of the HTML code I want to parse:下面是我要解析的HTML代码的例子:

<td style="padding-left:5px;" title="col1 : val1
 col2 : val2">

There are several rows.有几行。 I am using beautiful soup to parse the HTML code by selecting 'td' as follows我正在使用漂亮的汤通过选择“td”来解析 HTML 代码,如下所示

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

tbody=table.select_one('tbody')
tds = tbody.find_all("td")

In this example, How can extract col1=val1, col2=val2 as dataframe?在此示例中,如何将 col1=val1, col2=val2 提取为 dataframe? It is hard to me, because the values is in attrs.这对我来说很难,因为价值观在属性中。

col1   col2
==========
val1   val2
val1-2 val2-2
.
.
.

I try this我试试这个

tds.attrs['title']

but my code is not working.但我的代码不工作。

Give me hint..给我提示。。

Based on the data you have provided , I have used re to separate out the title value, put them in a dict and converted it to a dataframe.根据您提供的数据,我使用re分离出title值,将它们放入dict中并将其转换为dataframe。

I have added an extra <td> to simulate getting data from all the <td> .我添加了一个额外的<td>来模拟从所有<td>获取数据。

import re
import pandas as pd
from bs4 import BeautifulSoup
s = '''<td style="padding-left:5px;" title="col1 : val1 col2 : val2">Data1</td>
 <td style="padding-left:5px;" title="col1 : val3 col2 : val4">Data2</td>'''

soup = BeautifulSoup(s, 'lxml')
tds = soup.find_all('td')
d = {'col1': [], 'col2': []}
for i in tds:
    title = i['title'].strip()
    f = re.findall(r'col1\s:\s(.*)\scol2\s:\s(.*?)$',title)[0]
    d['col1'].append(f[0])
    d['col2'].append(f[1])
    
df = pd.DataFrame(d)
print(df)
   col1  col2
0  val1  val2
1  val3  val4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM