简体   繁体   English

从python中的html提取2条信息

[英]Extract 2 pieces of information from html in python

I need help figuring out how to extract Grab and the number following data-b. 我需要帮助弄清楚如何提取Grab和数据b之后的数字。 There are many <tr> in the complete unmodified webpage and I need to filter using the "Need" just before </a> . 完整的未修改网页中有很多<tr> ,我需要使用</a>之前的“ Need”进行过滤。 I've been trying to do this with beautiful soup, though it looks like lxml might work better. 我一直在尝试用漂亮的汤来做,尽管看起来lxml可能会更好。 I can get either all of the <tr> s or only the < a>...< /a> lines that contain Need but not just the <tr> s that contain need in that <a> line. 我可以获取所有包含需求的<tr>或仅<a> < a>...< /a>行,而不能仅获取该<a>行中包含需求的<tr>

<tr >
     <td>3</td>
     <td><a href="/local/app">Leave</a></td><td><a href="https://www.leave.com/" target="_blank">Useless</a></td>
     <td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
     <td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
     <td class="text-right">7.38%</td>
     <td class="text-right " >Recently</td>
</tr>

<tr >
     <td>4</td>
     <td><a href="/local">Grab</a></td><td><a href="https://grab.com" target="_blank">Need</a></td>
     <td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
     <td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
     <td class="text-right">Some more</td>
     <td class="text-right " >Recently</td>
</tr>

Thanks for any help! 谢谢你的帮助!

from bs4 import BeautifulSoup


data = '''<tr>
 <td>3</td>
 <td><a href="/local/app">Leave</a></td><td><a href="https://www.leave.com/" target="_blank">Useless</a></td>
 <td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
 <td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
 <td class="text-right">7.38%</td>
 <td class="text-right " >Recently</td>
</tr>

<tr>
 <td>4</td>
 <td><a href="/local">Grab</a></td><td><a href="https://grab.com" target="_blank">Need</a></td>
 <td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
 <td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
 <td class="text-right">Some more</td>
 <td class="text-right " >Recently</td>
</tr>
'''

soup = BeautifulSoup(data)
print(soup.findAll('a',{"href":"/local" })[0].text)
for a in soup.findAll('span',{"class":["bloat","bloat2"]}):
  print(a['data-b'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM