简体   繁体   English

Python + BeautifulSoup-通过搜索条件提取文本

[英]Python + BeautifulSoup - Text extraction by searching criteria

A file contains HTML codes like below (the words 'Registration' and 'Flying' are fixed in the following paragraphs): 一个文件包含如下所示的HTML代码(以下段落中固定了“ Registration”和“ Flying”一词):

<TR>
<TD class=CAT2 width="10%">Registration</TD>
<TD class=CAT1 width="20%">02 Mar 2006</TD></TR>

<TR>
<TD class=CAT2 width="10%">Flying</TD>
<TD class=CAT1 width="20%">24 Jun 2005</TD></TR>

I want to extract them and put as: 我想提取它们并放入:

Registration 02 Mar 2006 注册2006年3月2日

Flying 24 Jun 2005 飞行2005年6月24日

I am using the BeautifulSoup find_next_sibling however it returns nothing. 我正在使用BeautifulSoup find_next_sibling但是它什么也不返回。 What's went wrong? 怎么了

from bs4 import BeautifulSoup

url = r"C:\example.html"
page = open(url)
soup = BeautifulSoup(page.read())

aa = soup.find_next_sibling(text='Registration')

print aa

尝试这个

soup.find(text="Registration").findNext('td').contents[0]

This line of code: 这行代码:

aa = soup.find_next_sibling(text='Registration')

is not returning a node in the HTML as you are expecting it would. 没有像您期望的那样在HTML中返回节点。 Instead it is returning a NoneType . 相反,它返回一个NoneType What you want to do instead is, find the element with text='Registration' get it's parent and get the parent's next sibling. 相反,您要做的是找到带有text='Registration'的元素,使其成为父级,并获取父级的下一个同级。

aa = soup.find(text='Registration')
par = aa.parent
print par.next_sibling.string

You could also achieve your output as: 您还可以通过以下方式实现输出:

soup = BeautifulSoup(page.read())

row_1 = soup.find('tr')
td = row_1.find('td')
string_1 = td.string + ' ' + td.next_sibling.string #Registration 02 Mar 2006

row_2 = row_1.next_sibling
td = row_2.find('td')
string_2 = td.string + ' ' + td.next_sibling.string #Flying 24 Jun 2005

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM