简体   繁体   English

提取URL列表并检查有效性的快速方法

[英]Fast way to extract a list of URLs and check validity

I am working on a chat bot. 我正在使用聊天机器人。 I want it to post matching data from an API whenever a link to a gallery on an imageboard is posted. 我希望它在发布到imageboard上的图库的链接时发布来自API的匹配数据。 The gallery link looks like this 图库链接看起来像这样

https://example.com/a/1234/a6fb1049/

where 1234 is a positive number (id) and a6fb1049 is a hexadecimal String of fixed length 10 (token). 其中1234是一个正数(id),a6fb1049是一个固定长度10(令牌)的十六进制字符串。 Right now I am only able to process messages starting with a gallery link. 现在,我只能处理以Gallery链接开头的消息。

if message_object.content.startswith("https://example.com/a/"):

I am looking for a fast way to process the message string, because every time a message is sent this will be invoked. 我正在寻找一种处理消息字符串的快速方法,因为每次发送消息时都会调用该消息。

if message_object.content.startswith("https://example.org/a/"):

        temp = message_object.content.split("/")

        # Check if link is actually a valid link
        if temp[2] == "example.org" and temp[3] == "a" and 0 < int(temp[4]) and len(temp[5]) == 10:
            gallery_id = temp[4]
            gallery_token = temp[5]

            response = requests.post(url, payload, json_request_headers)

I thought about using urllib.parse.urlparse and posixpath.split to split the string and checking the different substrings, but I feel like this is inefficient. 我考虑过使用urllib.parse.urlparse和posixpath.split来分割字符串并检查不同的子字符串,但是我觉得这样效率很低。

Because I am really not good with Regex, this is all I came up with. 因为我对Regex真的不好,所以这就是我的全部想法。

searchObj = re.search( r'https://example.org/a/(.*)/(.*)/', message)

It's fine if there is just one matching pattern, and it's right, but as soon as there are two links this is already failing. 如果只有一种匹配的模式就可以了,这是正确的,但是一旦有两个链接,这已经失败了。

I would rather get all of the messages matching links in a list then iterate over the list and check the header of the page if the link is valid. 我宁愿在列表中获取所有与链接匹配的消息,然后遍历列表并检查页面标题是否有效。 Then create an API request to retrieve the data. 然后创建一个API请求以检索数据。

The regular expressions to match URLs on Stackoverflow don't show how you only match such specific cases, so I am sorry if this is a newb question. 与Stackoverflow上的URL匹配的正则表达式没有显示您如何仅匹配此类特定情况,因此,如果这是一个newb问题,我们感到抱歉。

I don't understand why you wrote: https://example.org/a/(.*)/(.*)/ when at the same time you precisely know that "1234 is a positive number (id) and a6fb1049 is a hexadecimal String of fixed length 10" (<= or perhaps 8) . 我不明白您为什么这样写: https://example.org/a/(.*)/(.*)/ : //example.org/a/(.*)/(.*)/同时您确切地知道“ 1234是一个正数(id),而a6fb1049是固定长度为10“ (<=或8) 的十六进制字符串 Translating this sentence into a pattern is very easy and needs only simple notions: 将此句子转换为模式非常容易,仅需要简单的概念即可:

re.findall(r'(https://example.org/a/([0-9]+)/([0-9a-f]{10})/)', message)

re.findall is the method to get several results ( re.search returns only the first result, see the re module manual ) re.findall是获取多个结果的方法re.search仅返回第一个结果,请参阅re模块手册

You obtain a list of lists where each item contains matched parts enclosed by round brackets (capture groups), feel free to put them where you want. 您将获得一个列表列表,其中每个项目均包含用圆括号(捕获组)括起来的匹配零件,可以随时将它们放置在所需位置。

If you want to know if there are links that don't match the format you want, you can also use something like this: 如果您想知道是否存在与所需格式不匹配的链接,则也可以使用以下格式:

re.findall(r'(https://example.org/a/(?:([0-9]+)/([0-9a-f]{10})/)|.*)', message)

Then you only have to test is the group 2 is None or not to know if a link has the good format. 然后,您只需要测试组2是否为None或不知道链接的格式是否正确即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM