简体   繁体   English

带有python漂亮汤的HTML表

[英]HTML tables with python beautiful soup

I have a HTML table which looks like this : 我有一个HTML表格,看起来像这样:

<table border=0 cellspacing=1 cellpadding=2 class=form>
<tr class=form><td class=formlabel>Heating Coils in Bunker Tanks</td><td class=form>N</td></tr>
<tr class=forma><td class=formlabel>Heating Coils in Cargo Tanks</td><td class=form>U</td></tr>
<tr class=form><td class=formlabel>Manifold Type</td><td class=form>N</td></tr>
<tr class=forma><td class=formlabel>No. Holds</td><td class=form>5</td></tr>
<tr class=form><td class=formlabel>No. Centreline Hatches</td><td class=form>5</td></tr>
<tr class=forma><td class=formlabel>Lifting Gear</td><td class=form>Yes</td></tr>
<tr class=form><td class=formlabel>Gear</td><td class=form>4 Crane (30.5t SWL)</td></tr>
<tr class=forma><td class=formlabel>Alteration</td><td class=form>Unknown</td></tr>
</table>

I am using Beautiful soup to extract specific data which comes as a response from a scrapy spider 我正在使用Beautiful汤来提取特定数据,这些数据是从一只刮spider的蜘蛛那里得到的

soup = BeautifulSoup(response.body_as_unicode())
table= soup.find('table', {'class': 'form'})
# psusedo code find manifold type and number of Holds

How do i go about doing this.Do note that the ordering of the values might change but the form label always remains the same ? 我该怎么做。请注意,值的顺序可能会更改,但表单标签始终保持不变? How do i search using a specific form label ? 如何使用特定的表单标签进行搜索?

Edit: 编辑:

<tr class=forma><td class=formlabel>Fleet Manager (Operator)</td><td class=form><a href="oBasic.asp?LRNumber=9442964&Action=Display&LRCompanyNumber=40916">ESSAR SHIPPING LTD</a></td></tr>

this particular case scenario does not work with the following sibling search ? 这种特殊情况下的情况不适用于以下同级搜索吗? How to overcome this ? 如何克服呢?

You can find the td element by text and get the next sibling : 您可以通过文本找到td元素并获得下一个兄弟

table.find('td', text='Manifold Type').next_sibling.text

As a side note, why do you need to use BeautifulSoup inside a Scrapy spider? 附带说明一下,为什么需要在Scrapy蜘蛛内使用BeautifulSoup Scrapy itself is pretty powerful in terms of HTML parsing, locating elements: Scrapy本身在HTML解析,定位元素方面非常强大:

response.xpath('//table[@class="form"]//td[.="Manifold Type"]/following-sibling::td/text()')

Demo from the scrapy shell : 来自scrapy shell演示:

$ scrapy shell index.html
In [1]: response.xpath('//table[@class="form"]//td[.="Manifold Type"]/following-sibling::td/text()').extract()
Out[1]: [u'N']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM