简体   繁体   English

正则表达式从括号中获取 id

[英]Regex get id from enclosed brackets

I was fetching data from Microsoft regarding their plans, here's the webpage for reference of data我正在从 Microsoft 获取有关他们计划的数据,这是数据参考的网页

https://docs.microsoft.com/en-us/azure/active-directory/enterprise-users/licensing-service-plan-reference#feedback https://docs.microsoft.com/en-us/azure/active-directory/enterprise-users/licensing-service-plan-reference#feedback

I'm working on table data to fetch the respective products with their guid initially for the first column it was easy but for the last, they were just using break tags in it.我正在处理表数据,以最初为第一列获取带有 guid 的相应产品,这很容易,但对于最后一列,他们只是在其中使用了中断标记。 Here's my code for it.这是我的代码。

import requests
from requests.api import head
from bs4 import BeautifulSoup
import pandas as pd
import json
import re

url = "https://docs.microsoft.com/en-us/azure/active-directory/enterprise-users/licensing-service-plan-reference#feedback"

payload = {}
headers = {}

response = requests.request("GET", url, headers=headers, data=payload)
soup = BeautifulSoup(response.content, 'lxml')
table = soup.find( "table" )
df = pd.read_html(str(table))[0]
df = df.drop(labels=['Service plans included'],
  axis='columns')
json_dict = json.loads(df.to_json(orient='records'))
regex = r"([A-Z ]+ \(.*?\))"
microsoft_processed_data = []
for item in json_dict:
    plan_data = item["Service plans included (friendly names)"]
    matches = re.findall(regex, plan_data)
    dict = {}
    for match in matches:
        dict_key = match.split("(", )[1]
        dict_key = dict_key.replace(")", "")
        dict_value = match.split(" (")[0]
        print(dict_key + " : " + dict_value)
        dict[dict_key] = dict_value
    item["Service plans included (friendly names)"] = dict
    microsoft_processed_data.append(item)

with open('data.json', 'w') as f:
    json.dump(microsoft_processed_data, f, indent = 4)

It worked till they started using brackets in their plan names as well and my regex failed which was working in a group.它一直有效,直到他们开始在他们的计划名称中使用括号并且我的正则表达式失败了,这是在一个组中工作。

If we consider this sample row out of all如果我们考虑这个样本行

EXCHANGE ONLINE (PLAN 1) (9aaf7827-d63c-4b61-89c3-182f06f82e5c)在线交换(计划 1) (9aaf7827-d63c-4b61-89c3-182f06f82e5c)

then as per my regex, it was picking up text starting from the beginning to the end of closed brackets.然后根据我的正则表达式,它从封闭括号的开头到结尾获取文本。

so, my regex my picked up data till > EXCHANGE ONLINE (PLAN 1)所以,我的正则表达式将我获取的数据直到 > EXCHANGE ONLINE(计划 1)

But I'm looking to get the data till the guid id of it and then separate the name for dictionary.但我希望获取数据直到它的 guid id,然后将字典的名称分开。

Here's my sample Expected Dictionary这是我的示例预期字典

{
    "EXCHANGE ONLINE (PLAN 1)" : "9aaf7827-d63c-4b61-89c3-182f06f82e5c"
}

Try: regex = r"([AZ ]+ \\(.*?\\))\\s*\\((.*?)\\)" , It will give tuple of the values you are looking for.尝试: regex = r"([AZ ]+ \\(.*?\\))\\s*\\((.*?)\\)" ,它将给出您正在寻找的值的元组。

Let's focus on \\s*\\((.*?)\\) part only让我们只关注\\s*\\((.*?)\\)部分

  • \\s* will match any number of white space characters \\s* 将匹配任意数量的空白字符
  • Then \\((.*?)\\) will take any thing between the parenthesis.然后\\((.*?)\\)将在括号之间取任何东西。
re.findall(regex, text)
[('EXCHANGE ONLINE (PLAN 1)', '9aaf7827-d63c-4b61-89c3-182f06f82e5c')]

You can directly pass it to the dict if you want to create a dictionary out of it:如果你想用它创建一个字典,你可以直接将它传递给dict

dict(re.findall(regex, text))
{'EXCHANGE ONLINE (PLAN 1)': '9aaf7827-d63c-4b61-89c3-182f06f82e5c'}

([AZ,0-9,\\-,\\(,\\),\\s\\+\\s*\\(.*?\\))

如果有人在 Microsoft Office 365 许可证的未来查看数据集,然后在您的爬虫中使用它来获取每种许可证类型的所有子许可证,则此正则表达式非常完美。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM