简体   繁体   English

使用美丽的汤选择文本数据

[英]Selecting text data using beautiful soup

Okay I am trying to select text data from the html below using python beautiful soup but I am having trouble.好的,我正在尝试使用 python 美丽的汤从下面的 html 中选择文本数据,但我遇到了问题。 Basically there is a title within the <b> , but I want the data outside of that.基本上在<b>有一个标题,但我想要除此之外的数据。 For instance the first is assessment type, but I only want the capacity curve.例如第一个是评估类型,但我只想要容量曲线。 Here is what I have so far:这是我到目前为止所拥有的:

modelinginfo = soup.find( "div", {"id":"genInfo"} ) # this is my raw data
rows=modelinginfo.findChildren(['p']) # this is the data displayed below
for row in rows:
    print(row)
    print('/n')
    cells = row.findChildren('p')
    for cell in cells:
         value = cell.string
         print("The value in this cell is %s" % value)


[<p><b>Assessment Type: </b>Capacity curve</p>,
 <p><b>Name: </b>Borzi et al (2008) - Capacity-Xdir 4Storeys InfilledFrame NonSismicallyDesigned</p>,
 <p><b>Category: </b>Structure specific - Building</p>,
 <p><b>Taxonomy: </b>CR/LFINF+DNO/HEX:4 (GEM)</p>,
 <p><b>Reference: </b>The influence of infill panels on vulnerability curves for RC buildings (Borzi B., Crowley H., Pinho R., 2008) - Proceedings of the 14th World Conference on Earthquake Engineering, Beijing, China</p>,
 <p><b>Web Link: </b><a href="http://www.iitk.ac.in/nicee/wcee/article/14_09-01-0111.PDF" style="color:blue" target="_blank"> http://www.iitk.ac.in/nicee/wcee/article/14_09-01-0111.PDF</a></p>,
 <p><b>Methodology: </b>Analytical</p>,
 <p><b>General Comments: </b>Sample Data: A 4-storey building designed according to the 1992 Italian design code (DM, 1992), considering gravity loads only, and the Decreto Ministeriale 1996 (DM, 1996) when considering seismic action (the seismically designed building has been designed assuming a lateral force equal to 10% of the seismic weight, c=10%, and with a triangular distribution shape).

 The Y axis in the capacity curve represent the collapse multiplier: Base shear resistance over seismic weight.</p>,
 <p><b>Geographical Applicability: </b> Italy</p>]

1.) You can iterate over p children and print everything, except of b tag: 1.) 您可以遍历p 孩子并打印除b标签之外的所有内容:

for cell in cells:
    for element in cell.children:
        if element.name != 'b':
            print("The value in this cell is %s" % element)

2.) You can use extract() method to clean up unneeded for you b tag: 2.) 您可以使用extract()方法来清理不需要的b标签:

for cell in cells:
    if cell.b:
        # remove "b" tag
        cell.b.extract()
    print("The value in this cell is %s" % cell)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM