[英]How to access the next element below in HTML file using beautiful soup
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
I've been trying my best to access the ns1:AreaId
which is (10YDK-1--------W)
through ns1:AffectedAreas
by using B = soup.find('ns1:area')
and then B.next_element
but all I get is an empty string .我一直在努力通过使用
B = soup.find('ns1:area')
然后B.next_element
通过ns1:AffectedAreas
访问ns1:AreaId
这是(10YDK-1--------W)
B.next_element
但我得到的只是一个空字符串。
You can try to iterate over the soup.find('ns1:area')
childrens to find the ns1:areaid
tag and then to get his text.您可以尝试遍历
soup.find('ns1:area')
子项以查找ns1:areaid
标记,然后获取他的文本。
for i in soup.find('ns1:area').children:
if i.name == "ns1:areaid":
b = i.text
print(b)
And from ns1:AffectedAreas
it will look like从
ns1:AffectedAreas
看起来像
for i in soup.find_all('ns1:AffectedAreas'.lower()):
for child in i.children:
if child.name == "ns1:area":
for y in child.children:
if y.name == "ns1:areaid":
print(y.text)
Or to search the tag ns1:AreaId
in lower case and to get text of him.或者以小写形式搜索标签
ns1:AreaId
并获取他的文本。 this way you can get all the text values from all ns1:AreaId
tags.这样您就可以从所有
ns1:AreaId
标签中获取所有文本值。
soup.find_all("ns1:AreaId".lower())[0].text
Both cases will output两种情况都会 output
"10YDK-1--------W"
Try this method,试试这个方法,
import bs4
import re
data = """
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
"""
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
bs = bs4.BeautifulSoup(data, "html.parser")
areaid = bs.find_all('ns1:areaid')
print((striphtml(str(areaid))))
Here, striphtml
function will remove all the tags containing <>
.So the output will be,在这里,
striphtml
function 将删除所有包含<>
的标签。所以 output 将是,
[10YDK-1--------W]
If you have defined namespaces in your HTML/XML document, you can use xml
parser and CSS selectors.如果您在 HTML/XML 文档中定义了命名空间,则可以使用
xml
解析器和 CSS 选择器。
For example:例如:
txt = '''<root xmlns:ns1="some namespace">
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
</root>'''
soup = BeautifulSoup(txt, 'xml')
area_id = soup.select_one('ns1|AffectedAreas ns1|AreaId').text
print(area_id)
Prints:印刷:
10YDK-1--------W
Another method.另一种方法。
from simplified_scrapy import SimplifiedDoc, req, utils
html = '''
<ns1:AffectedAreas>
<ns1:Area>
<ns1:AreaId>10YDK-1--------W</ns1:AreaId>
<ns1:AreaName>DK1</ns1:AreaName>
</ns1:Area>
<ns1:Area>
<ns1:AreaId>10YDK-2--------W</ns1:AreaId>
<ns1:AreaName>DK2</ns1:AreaName>
</ns1:Area>
</ns1:AffectedAreas>
'''
doc = SimplifiedDoc(html)
AffectedArea = doc.select('ns1:AffectedAreas')
Areas = AffectedArea.selects('ns1:Area')
AreaIds = Areas.select('ns1:AreaId').html
print (AreaIds)
# or
# print (doc.select('ns1:AffectedAreas').selects('ns1:Area').select('ns1:AreaId').html)
Result:结果:
['10YDK-1--------W', '10YDK-2--------W']
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples这里有更多例子: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.