简体   繁体   English

Python Beautiful Soup查找字符串并提取以下字符串

[英]Python Beautiful Soup find string and extract following string

I am programming a web crawler with the help of beautiful soup.I have the following html code: 我正在美丽的汤的帮助下对网络爬虫进行编程。我有以下html代码:

<tr class="odd-row">
        <td>xyz</td>
        <td class="numeric">5,00%</td>      
    </tr>
<tr class="even-row">
        <td>abc</td>
        <td class="numeric">50,00%</td                      
    </tr>
<tr class="odd-row">
        <td>ghf</td>
        <td class="numeric">2,50%</td>

My goal is to write the numbers after class="numeric" to a specific variable. 我的目标是将class =“ numeric”之后的数字写入特定变量。 I want to do this conditional on the string above the class statement (eg "xyz", "abc", ...). 我想以类声明上方的字符串为条件(例如“ xyz”,“ abc”,...)。

At the moment I am doing the following: 目前,我正在执行以下操作:

for c in soup.find_all("a", string=re.compile('abc')):
    abc=c.string

But of course it returns the string "abc" and not the number in the tag afterwards. 但是,当然,它随后返回字符串“ abc”,而不是标签中的数字。 So basically my question is how to adress the string after class="numeric" conditional on the string beforehand. 因此,基本上我的问题是,如何在class =“ numeric”之后以字符串为条件预先获取字符串。

Thanks for your help!!! 谢谢你的帮助!!!

Once you find the correct td which I presume is what you meant to have in place of a then get the next sibling with the class you want: 一旦找到正确的td ,我想它就是您要代替的a,然后获取您想要的类的下一个兄弟姐妹:

h = """<tr class="odd-row">
        <td>xyz</td>
        <td class="numeric">5,00%</td>
    </tr>
<tr class="even-row">
        <td>abc</td>
        <td class="numeric">50,00%</td
    </tr>
<tr class="odd-row">
        <td>ghf</td>
        <td class="numeric">2,50%</td>"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(h)

for td in soup.find_all("td",text="abc"):
    print(td.find_next_sibling("td",class_="numeric"))

If the numeric td is always next you can just call find_next_sibling() : 如果数字td总是下一个,则可以调用find_next_sibling()

for td in soup.find_all("td",text="abc"):
    print(td.find_next_sibling())

For your input both would give you: 您的输入都将给您:

td class="numeric">50,00%</td>

If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this: 如果我正确理解了您的问题,并且假设您的html代码将始终遵循示例结构,则可以执行以下操作:

result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
    table_columns = row.find_all("td")
    result[table_columns[0].text] = tds[1].text
print result  #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}

You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric" 您最终得到了一个字典,其键名是'xyz','abc'.. etc,它们的值是class="numeric"中的字符串

So as I understand your question you want to iterate over the tuples ('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). 因此,据我所知,您想遍历元组('xyz','5,00%'),('abc','50,00%'),('ghf','2,50%' )。 Is that correct? 那是对的吗?

But I don't understand how your code produces any results, since you are searching for <a> tags. 但是我不了解您的代码如何产生任何结果,因为您正在搜索<a>标记。

Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. 相反,您应该遍历<tr>标记,然后将这些字符串放入<td>标记内。 Notice the double next_sibling for accessing the second <td> , since the first next_sibling would reference the whitespace between the two tags. 请注意,用于访问第二个<td>的double next_sibling ,因为第一个next_sibling将引用两个标记之间的空白。

html = """
<tr class="odd-row">
    <td>xyz</td>
    <td class="numeric">5,00%</td>      
</tr>
<tr class="even-row">
    <td>abc</td>
    <td class="numeric">50,00%</td                      
</tr>
<tr class="odd-row">
    <td>ghf</td>
    <td class="numeric">2,50%</td>
</tr>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for tr in soup.find_all("tr"):
    print((tr.td.string, tr.td.next_sibling.next_sibling.string))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM