[英]Beautiful Soup - Extract data only contain td tag (without tag like div, id, class…)
I'm new to Beautiful Soup, and I have data like this, which contain 3 set of user data(for this case).我是 Beautiful Soup 的新手,我有这样的数据,其中包含 3 组用户数据(对于这种情况)。
I want to get all the information for each USER_ID and save to database.我想获取每个 USER_ID 的所有信息并保存到数据库。
<table align="center" border="0" style="width:550px">
<tbody>
<tr>
<td colspan="2">USER_ID 11111</td>
</tr>
<tr>
<td colspan="2">string_a</td>
</tr>
<tr>
<td colspan="2"><strong>content: aaa</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://aaa.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2">USER_ID 22222</td>
</tr>
<tr>
<td colspan="2">string_b</td>
</tr>
<tr>
<td colspan="2"><strong>content: bbb</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://aaa.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2">USER_ID 33333</td>
</tr>
<tr>
<td colspan="2">string_c</td>
</tr>
<tr>
<td colspan="2"><strong>content: ccc</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://ccc.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
</tbody>
</table>
My problem is,我的问题是,
All the data are inside td only, and do not contain div name and no parent tag.所有数据仅在 td 内,不包含 div 名称和父标签。 I can't separate into 3 set of data.
我不能分成 3 组数据。
I have try the following code, it can find all the USER_ID, but I don't know how to get other data for each USER_ID我尝试了以下代码,它可以找到所有 USER_ID,但我不知道如何获取每个 USER_ID 的其他数据
soup = BeautifulSoup(content, 'html.parser')
p = soup.find_all('td', text=re.compile("^USER_ID"))
for item in p:
title = item.find_next_siblings('td') # <--- return empty
...
I'm using我在用着
python 3.6 python 3.6
django 2.0.2 django 2.0.2
from bs4 import BeautifulSoup
import re
from more_itertools import split_when
data = """<table align="center" border="0" style="width:550px">
<tbody>
<tr>
<td colspan="2">USER_ID 11111</td>
</tr>
<tr>
<td colspan="2">string_a</td>
</tr>
<tr>
<td colspan="2"><strong>content: aaa</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://aaa.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2">USER_ID 22222</td>
</tr>
<tr>
<td colspan="2">string_b</td>
</tr>
<tr>
<td colspan="2"><strong>content: bbb</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://aaa.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2">USER_ID 33333</td>
</tr>
<tr>
<td colspan="2">string_c</td>
</tr>
<tr>
<td colspan="2"><strong>content: ccc</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://ccc.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
</tbody>
</table>"""
soup = BeautifulSoup(data, 'html.parser')
target = soup.find("table", align="center")
goal = [item.text for item in target.select(
"td", text=re.compile("^USER_ID")) if item.text.strip() != '']
final = list(split_when(goal, lambda _, y: y.startswith("USER")))
print(final) # list of lists
for x in final: # or loop
print(x)
Output Output
[['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']]
And和
['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']
Try following code which will identify find_all_next('td')
and check with if condition to break the dataset
.尝试使用以下代码来识别
find_all_next('td')
并检查 if 条件以破坏dataset
。
import re
from bs4 import BeautifulSoup
html='''<table align="center" border="0" style="width:550px">
<tbody>
<tr>
<td colspan="2">USER_ID 11111</td>
</tr>
<tr>
<td colspan="2">string_a</td>
</tr>
<tr>
<td colspan="2"><strong>content: aaa</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://aaa.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2">USER_ID 22222</td>
</tr>
<tr>
<td colspan="2">string_b</td>
</tr>
<tr>
<td colspan="2"><strong>content: bbb</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://aaa.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2">USER_ID 33333</td>
</tr>
<tr>
<td colspan="2">string_c</td>
</tr>
<tr>
<td colspan="2"><strong>content: ccc</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://ccc.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
</tbody>
</table>'''
soup=BeautifulSoup(html,'html.parser')
final_list=[]
for item in soup.find_all('td',text=re.compile("USER_ID")):
row_list=[]
row_list.append(item.text.strip())
siblings=item.find_all_next('td')
for sibling in siblings:
if "USER_ID" in sibling.text:
break
else:
if sibling.text.strip()!='':
row_list.append(sibling.text.strip())
final_list.append(row_list)
print(final_list)
Output : Output :
[['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']]
If you want each list to print try this.如果您希望打印每个列表,请尝试此操作。
soup=BeautifulSoup(html,'html.parser')
for item in soup.find_all('td',text=re.compile("USER_ID")):
row_list=[]
row_list.append(item.text.strip())
siblings=item.find_all_next('td')
for sibling in siblings:
if "USER_ID" in sibling.text:
break
else:
if sibling.text.strip()!='':
row_list.append(sibling.text.strip())
print(row_list)
Output : Output :
['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']
You can simply use soup.select('table tr')
你可以简单地使用
soup.select('table tr')
Example例子
from bs4 import BeautifulSoup
html = '<table align="center" border="0" style="width:550px"><tbody>' \
'<tr><td colspan="2">USER_ID 11111</td></tr>' \
'<tr><td colspan="2">string_a</td></tr>' \
'<tr><td colspan="2"><strong>content: aaa</strong></td></tr>' \
'<tr><td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
'<tr><td colspan="2"><strong>URL:https://aaa.com</strong></td></tr>' \
'<tr><td colspan="2"> </td></tr>' \
'<tr><td colspan="2"> </td></tr>' \
'<tr><td colspan="2">USER_ID 22222</td></tr>' \
'<tr><td colspan="2">string_b</td></tr>' \
'<tr><td colspan="2"><strong>content: bbb</strong></td></tr>' \
'<tr><td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
'<tr><td colspan="2"><strong>URL:https://aaa.com</strong></td></tr>' \
'<tr><td colspan="2"> </td></tr>' \
'<tr><td colspan="2"> </td></tr>' \
'<tr><td colspan="2">USER_ID 33333</td></tr>' \
'<tr><td colspan="2">string_c</td></tr>' \
'<tr><td colspan="2"><strong>content: ccc</strong></td></tr>' \
'<tr><td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
'<tr><td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td></tr>' \
'<tr><td colspan="2"><strong>URL:https://ccc.com</strong></td></tr>' \
'<tr><td colspan="2"> </td></tr>' \
'<tr><td colspan="2"> </td></tr></tbody></table>'
soup = BeautifulSoup(html, features="lxml")
elements = soup.select('table tr')
print(elements)
for element in elements:
print(element.text)
Prints out打印出来
USER_ID 11111
string_a
content: aaa
date:2020-05-01 00:00:00 To 2020-05-03 23:59:59
URL:https://aaa.com
USER_ID 22222
string_b
content: bbb
date:2020-05-01 00:00:00 To 2020-05-03 23:59:59
URL:https://aaa.com
USER_ID 33333
string_c
content: ccc
date:2020-05-01 00:00:00 To 2020-05-03 23:59:59
PID:ABCDE
URL:https://ccc.com
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.