[英]Scraping data from the tag names in python
Hi I am trying to scrape user data from a website.嗨,我正在尝试从网站上抓取用户数据。 I need User ID which are available in the tag names itself.I am trying to scrape the UID using python selenium and beautiful soup in the div tag.我需要在标签名称中可用的用户 ID。我试图在div标签中使用 python selenium 和漂亮的汤来抓取 UID。
Example:例子:
<"div id="UID_**60CE07D6DF5C02A987ED7B076F4154F3**-SRC_328619641" class="memberOverlayLink" onmouseover="ta.trackEventOnPage('Reviews','show_reviewer_info_window','user_name_photo'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', 0, (new Element(this)).getElement('.avatar')&&(new Element(this)).getElement('.avatar').getStyle('border-radius')=='100%'?-10:0);">
I am trying to scrape the UID using python selenium and beautiful soup in the div tag .我试图在 div 标签中使用 python selenium 和漂亮的汤来抓取 UID。 I have looked through all the documentation and several web pages but I can't find a solution for this.我浏览了所有文档和几个网页,但找不到解决方案。 If anyone can please tell me if such a thing is possible I would be very grateful.如果有人可以请告诉我这样的事情是否可能,我将不胜感激。
Assuming the id
attribute value is always in the format UID_
followed by one or more alphanumeric characters followed by -SRC_
followed by one or more digits:假设id
属性值始终采用UID_
后跟一个或多个字母数字字符后跟-SRC_
后跟一个或多个数字的格式:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
pattern = re.compile(r"UID_(\w+)\-SRC_\d+")
id = soup.find("div", id=pattern)["id"]
uid = pattern.match(id).group(1)
print(uid)
Here we are using BeautifulSoup
and searching for an id
attribute value to match a specific regular expression .这里我们使用BeautifulSoup
并搜索id
属性值以匹配特定的正则表达式。 It contains a saving group (\\w+)
that helps us to extract the UID value.它包含一个保存组(\\w+)
,可帮助我们提取 UID 值。
you can use .get method and scrape the tag names easily,您可以使用.get方法并轻松抓取标签名称,
in your question;在你的问题中;
soup.get('id')
of course, if there are many id tags exist, you need to use more specific tags with find or find_all method before using the .get当然,如果存在很多id标签,您需要在使用.get之前使用find或find_all方法使用更具体的标签
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.