简体   繁体   English

遍历 python 中的特定标签

[英]Iterate through specific tags in python

I want to extract text from the website and the format is like this:我想从网站中提取文本,格式如下:

<a href="#N44">Avalon</a>
<a href="#N36">Avondale</a>
<a href="#N4">Bacon Park Area</a>

How do I just select those 'a' tags with href="#N" because there are several more?我如何只 select 那些带有 href="#N" 的“a”标签,因为还有更多?

I tried creating a list to iterate through but when I try the code, it selects only one element.我尝试创建一个列表来迭代,但是当我尝试代码时,它只选择一个元素。

loc= ['#N0', '#N1', '#N2', '#N3', '#N4', '#N5'.....'#N100']

for i in loc:
    name=soup.find('a', attrs={'href':i})    
print(name)

I get我明白了

<a href="#N44">Avalon</a>

not不是

<a href="#N44">Avalon</a>
<a href="#N36">Avondale</a>
<a href="#N4">Bacon Park Area</a

How about just?刚刚怎么样?

Avalon
Avondale
Bacon Park Area

Thanks in advance!提前致谢!

You're iterating over the items, but not putting them anywhere.您正在迭代这些项目,但没有将它们放在任何地方。 So when you are done with your loop all that's left in name is the last item.因此,当您完成循环后, name中剩下的就是最后一项。

You can put them in a list like below, and access the .text attribute to get just the name from the tag:您可以将它们放在如下列表中,并访问.text属性以仅从标签中获取名称:

names = []

for i in loc:
    names.append(soup.find('a',attrs={'href':i}).text) 

Result:结果:

In [15]: names
Out[15]: ['Bacon Park Area', 'Avondale', 'Avalon']

If you want to leave out the first list's creation you can just do:如果您想省略第一个列表的创建,您可以这样做:

import re

names = [tag.text for tag in soup.find_all('a',href=re.compile(r'#N\d+'))] 

In a regular expression, the \d means digit and the + means one or more instances of.在正则表达式中, \d表示数字, +表示一个或多个实例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM