简体   繁体   English

从python中的标签名称中抓取数据

[英]Scraping data from the tag names in python

Hi I am trying to scrape user data from a website.嗨,我正在尝试从网站上抓取用户数据。 I need User ID which are available in the tag names itself.I am trying to scrape the UID using python selenium and beautiful soup in the div tag.我需要在标签名称中可用的用户 ID。我试图在div标签中使用 python selenium 和漂亮的汤来抓取 UID。

Example:例子:

<"div id="UID_**60CE07D6DF5C02A987ED7B076F4154F3**-SRC_328619641" class="memberOverlayLink" onmouseover="ta.trackEventOnPage('Reviews','show_reviewer_info_window','user_name_photo'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', 0, (new Element(this)).getElement('.avatar')&amp;&amp;(new Element(this)).getElement('.avatar').getStyle('border-radius')=='100%'?-10:0);">

I am trying to scrape the UID using python selenium and beautiful soup in the div tag .我试图在 div 标签中使用 python selenium 和漂亮的汤来抓取 UID。 I have looked through all the documentation and several web pages but I can't find a solution for this.我浏览了所有文档和几个网页,但找不到解决方案。 If anyone can please tell me if such a thing is possible I would be very grateful.如果有人可以请告诉我这样的事情是否可能,我将不胜感激。

Assuming the id attribute value is always in the format UID_ followed by one or more alphanumeric characters followed by -SRC_ followed by one or more digits:假设id属性值始终采用UID_后跟一个或多个字母数字字符后跟-SRC_后跟一个或多个数字的格式:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

pattern = re.compile(r"UID_(\w+)\-SRC_\d+")
id = soup.find("div", id=pattern)["id"]

uid = pattern.match(id).group(1)
print(uid)

Here we are using BeautifulSoup and searching for an id attribute value to match a specific regular expression .这里我们使用BeautifulSoup并搜索id属性值以匹配特定的正则表达式 It contains a saving group (\\w+) that helps us to extract the UID value.它包含一个保存组(\\w+) ,可帮助我们提取 UID 值。

you can use .get method and scrape the tag names easily,您可以使用.get方法并轻松抓取标签名称,

in your question;在你的问题中;

soup.get('id')

of course, if there are many id tags exist, you need to use more specific tags with find or find_all method before using the .get当然,如果存在很多id标签,您需要在使用.get之前使用findfind_all方法使用更具体的标签

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM