[英]how can we extract data from the HTML file using python
<p style="font-size: small;" class="apple"><a name="XREF_4567_Figure1_1"></a>Assembly, 1234, 456 & 789</p>
<div align="center"><image alt="apple.jpg" id="image2" source="assets/apple.jpg" />
</div>
In the above html code we need to extract "Assembly, 1234, 456 & 789" and "apple.jpg"在上面的 html 代码中,我们需要提取“Assembly, 1234, 456 & 789”和“apple.jpg”
And my python code is below我的 python 代码如下
for line in f:
if 'div align' in line.lower():
#get value after class="
myline=line.split("alt=\"")
#get value before "
number=myline[1].split("\"")[0]
numbers[i].append(number)
#print(count)
#subtract oldcount to find the count of hotspots in current file
count[i].append(0)
count[i].append(len(numbers[i])-oldcount)
i = i + 1
#print(i)
you can use BeautifulSoup
for that from library bs4
:您可以使用库
BeautifulSoup
中的bs4
:
from bs4 import BeautifulSoup
html = '<p style="font-size: small;" class="apple"><a name="XREF_4567_Figure1_1"></a>Assembly, 1234, 456 & 789</p><div align="center"><image alt="apple.jpg" id="image2" source="assets/apple.jpg" /> </div>'
bs = BeautifulSoup(html, 'html.parser')
print(bs.find('p').get_text())
print(bs.find('image').get("alt"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.