简体   繁体   English

从链接中提取关键字

[英]Extract keywords from links

I'm trying to extract the first 2 numbers in links like these:我正在尝试提取链接中的前2数字,如下所示:

https://primer.text.com/sdfg/8406758680-345386743-DSS1-S%20Jasd%12Odsfr%12Iwetds-Osdgf/ 
https://primer.text.com/sdfg/8945879094-849328844-DPE-S%20Jsdfe%12OIert-Isdfu/
https://primer.text.com/sdfg/8493093053-292494834-QW23%23Wsdfg%23Iprf%64Uiojn%32Asdfg-Werts/

The output should be like this: output 应该是这样的:

id1 = ['8406758680', '8945879094','8493093053']
id2 = ['345386743', '849328844', '292494834']

I'm trying to do this using the re module.我正在尝试使用re模块来执行此操作。

Please, tell me how to do it.请告诉我该怎么做。

This the code snippet I have so far:这是我到目前为止的代码片段:

def GetUrlClassId(UrlInPut):
    ClassID = ''
    for i in UrlInPut:
        if i.isdigit():
            ClassID+=i
        elif ClassID !='':
            return int(ClassID)
    return ""

def GetUrlInstanceID(UrlInPut):
    InstanceId = ''
    ClassID = 0
    for i in UrlInPut:
        if i.isdigit() and ClassID==1:
            InstanceId+=i
        elif InstanceId !='':
            return int(InstanceId)
        if i == '-':
            ClassID+=1
    return ""

I don't want to use something like this.我不想使用这样的东西。 I would like to use regular expressions.我想使用正则表达式。

With Regex, you can do a literal match on the base URL, and then capture two groups of multiple digits using \d+ ( \d matches 0-9, + matches at least one of the proceeding group).使用 Regex,您可以在基数 URL 上进行文字匹配,然后使用\d+捕获两组多位数字( \d匹配 0-9, +至少匹配前一组)。 re.findall returns a list of matching groups. re.findall返回匹配组的列表。

import re
l1 = "https://primer.text.com/sdfg/8406758680-345386743-DSS1-S%20Jasd%12Odsfr%12Iwetds-Osdgf/"
l2 = "https://primer.text.com/sdfg/8945879094-849328844-DPE-S%20Jsdfe%12OIert-Isdfu/"
l3 = "https://primer.text.com/sdfg/8493093053-292494834-QW23%23Wsdfg%23Iprf%64Uiojn%32Asdfg-Werts/"

for l in [l1, l2, l3]:
  result = re.findall(r'https://primer.text.com/sdfg/(\d+)-(\d+)', l)
  print(result)

Output: Output:

[('8406758680', '345386743')]
[('8945879094', '849328844')]
[('8493093053', '292494834')]

From here, reformatting into your desired data structure should be simple enough (use zip or something).从这里开始,重新格式化为您想要的数据结构应该足够简单(使用zip或其他东西)。

The regex pattern: /(\d{10})-(\d{9}) the brackets are needed to identify the groups of digits, the {} specifies an exact occurrence of a repetition, doc .正则表达式模式: /(\d{10})-(\d{9})需要方括号来标识数字组, {}指定重复的确切出现, doc

# urls separated by a white space
urls = 'https://primer.text.com/sdfg/8406758680-345386743-DSS1-S%20Jasd%12Odsfr%12Iwetds-Osdgf/ https://primer.text.com/sdfg/8945879094-849328844-DPE-S%20Jsdfe%12OIert-Isdfu/ https://primer.text.com/sdfg/8493093053-292494834-QW23%23Wsdfg%23Iprf%64Uiojn%32Asdfg-Werts/'

urls = urls.split() # as list

import re

ids = [re.search(r'/(\d{10})-(\d{9})', url).groups() for url in urls]
print(list(zip(*ids)))

Output Output

[('8406758680', '8945879094', '8493093053'), ('345386743', '849328844', '292494834')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM