[英]How to get specific item having same class name and attributes
How can I get the specific item with same Class name and attributes? 如何获得具有相同类名称和属性的特定项目?
I need to get these 3 items 我需要拿这3件
April 14, 2013
2013年4月14日
580
580
Fort Pierce, FL
佛罗里达皮尔斯堡
<dl class="pairsJustified">
<dt>Joined:</dt>
<dd>Apr 14, 2013</dd>
</dl>
<dl class="pairsJustified">
<dt>Messages:</dt>
<dd><a href="search/member?user_id=13302" class="concealed"
rel="nofollow">580</a></dd>
</dl>
<dl class="pairsJustified">
<dt>Location:</dt>
<dd>
<a href="misc/location-info?location=Fort+Pierce%2C+FL" target="_blank"
rel="nofollow noreferrer" itemprop="address" class="concealed">Fort
Pierce, FL</a>
Using they lie under the <dd>
tag, using .find_all()
: 使用它们位于
<dd>
标记下,使用.find_all()
:
from bs4 import BeautifulSoup
test = '''<dl class="pairsJustified">
<dt>Joined:</dt>
<dd>Apr 14, 2013</dd>
</dl>
<dl class="pairsJustified">
<dt>Messages:</dt>
<dd><a href="search/member?user_id=13302" class="concealed"
rel="nofollow">580</a></dd>
</dl>
<dl class="pairsJustified">
<dt>Location:</dt>
<dd>
<a href="misc/location-info?location=Fort+Pierce%2C+FL" target="_blank"
rel="nofollow noreferrer" itemprop="address" class="concealed">Fort Pierce, FL</a>'''
soup = BeautifulSoup(test, 'html.parser')
data = soup.find_all("dd")
for d in data:
print(d.text.strip())
OUTPUT : 输出 :
Apr 14, 2013
580
Fort Pierce, FL
this is a good starting point: 这是一个很好的起点:
In [18]: for a in response.css('.extraUserInfo'):
...: print(a.css('*::text').extract())
...: print('\n\n\n')
...:
['\n', '\n', '\n', '\n'] # <--this (and other outputs like this) is because there is an extra `extraUserInfo` class block above the desired info block if the user has a user group picture/avatar below their username
['\n', '\n', 'Joined:', '\n', 'Mar 24, 2013', '\n', '\n', '\n', 'Messages:', '\n', '6,747', '\n', '\n']
['\n', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Mar 24, 2013', '\n', '\n', '\n', 'Messages:', '\n', '6,747', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Apr 14, 2013', '\n', '\n', '\n', 'Messages:', '\n', '580', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Fort Pierce, FL', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Oct 20, 2012', '\n', '\n', '\n', 'Messages:', '\n', '2,476', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Philadelphia, PA', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Dec 11, 2012', '\n', '\n', '\n', 'Messages:', '\n', '2,938', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Colorado', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Sep 30, 2016', '\n', '\n', '\n', 'Messages:', '\n', '833', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Indiana', '\n', '\n', '\n']
...
There are many ways to approach this. 有很多方法可以解决此问题。 A little fiddling around will get the data formatted to your liking.
稍微摆弄一下即可将数据格式化为您喜欢的格式。 The approach above is only a good starting point because there are many lines with only newline character lists as outputs, thats because (it seems) that user info blocks where the user has a user-group image (like tesla of arizona) then the
extraUserInfo
class is also used to group that block of html. 上面的方法只是一个很好的起点,因为有很多行仅使用换行符列表作为输出,这是因为(看来)用户信息会阻止用户拥有用户组图像(例如亚利桑那州的特斯拉)的位置,然后是
extraUserInfo
类也用于对html块进行分组。 There will be better ways to group this... 会有更好的方法将其分组...
Basically response.css('.extraUserInfo') will aggregate all blocks with class extraUserInfo
which seems to be the blocks holding the user info you're looking for. 基本上,response.css('。extraUserInfo')将聚集具有
extraUserInfo
类的所有块,这似乎是保存您要查找的用户信息的块。 From there extract all underlying text with the ::text
pseudo selector and parse the arrays. 使用
::text
伪选择器从那里提取所有基础文本,并解析数组。
There is definitely a better way to approach this if you carefully look at the html structure so you are extracting it in a way that leaves you less processing work afterwards but this should get you on the right track. 如果仔细查看html结构,肯定有更好的方法来解决此问题,因此您以某种方式提取它会减少以后的处理工作,但这应该可以使您走上正确的轨道。 CSS selectors or xpath documentation should be great help.
CSS选择器或xpath文档应该有很大的帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.