如何从URL中提取参数？

Question

url = 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/'
url2 = 'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
new = url.split("/")[-4:]
new2 = url2.split("/")[-2:]
print(new)
print(new2)

Output : ['world-cuisine', 'asian', 'chinese', ''] 
         ['soups-stews-and-chili', '']

我需要的 output 是 ['world-cuisine', 'asian', 'chinese'] & ['soups-stews-and-chili']。
这些 URL 有不同的参数我无法绕过所有 URL 并仅提取数字后的主要参数
并且 URL 末尾的 '/' 是必需的，因为在 Scrapy 中，当我使用 URL w/o '/' 时，它会引发 301 错误，但正如您从 output 中看到的那样，由于反斜杠，有一个额外的 ''我不能省略。
如何获取各种 URL 的参数？

URL 的其他一些示例是：

'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/'

'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/'

我们如何编写规则来遵循此类 URL 的分页 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2'
规则(LinkExtractor(allow=(r'recipes/?page=\d+',)), follow=True)

我是 scrapy 和正则表达式的新手，因此非常感谢您帮助解决这个问题

Answer 1

您可以组合re module + str.split ：

import re

urls = [
    "https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/",
    "https://www.allrecipes.com/recipes/94/soups-stews-and-chili/",
    "https://www.allrecipes.com/recipes/416/seafood/fish/salmon/",
    "https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/",
]

r = re.compile(r"(?:\d+/)(.*)/")

for url in urls:
    print(r.search(url).group(1).split("/"))

印刷：

['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']

Answer 2

我不是 100% 确定我是否正确理解了您的问题，但我认为以下代码可以满足您的需求。

编辑
评论互动后更新代码

urls = [
    'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
    'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
    'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
    'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
    'https://www.allrecipes.com/recipes/qqqq/94/soups-stews-and-chili/x/y/z/q'
]

for url in urls:
    for index, part in enumerate(url.split('/')):
        if part.isnumeric():
            start = index+1
            break
    print(url.split('/')[start:-1])

output

['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['soups-stews-and-chili', 'x', 'y', 'z']

旧答案

urls = [
    'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
    'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
    'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
    'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
]

for url in urls:
    print(url.split("/")[5:-1])

output

['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']

Answer 3

像这样的东西。 这个想法是找到“int”路径元素并从其右侧获取所有路径元素。

from collections import defaultdict
from typing import Dict, List

urls = ['https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
        'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/']


def is_int(param: str) -> bool:
    try:
        int(param)
        return True
    except ValueError:
        return False


data: Dict[str, List[str]] = defaultdict(list)
for url in urls:
    elements = url.split('/')
    elements.reverse()
    loop = True
    while loop:
        for element in elements:
            if len(element.strip()) < 1:
                continue
            if not is_int(element):
                data[url].append(element)
            else:
                loop = False
                break
print(data)

output

defaultdict(<class 'list'>, {'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/': ['salmon', 'fish', 'seafood'], 'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/': ['pork', 'meat-and-poultry']})

Answer 4

在处理 url 时，尽量避免（或至少延迟） regex并首先查看urllib或类似的和/或split() 。

只有一个 url 具有完整的详细信息：

from urllib.parse import urlparse

urlparse(urls[4])

ParseResult(scheme='https', netloc='www.allrecipes.com', path='/recipes/695/world-cuisine/asian/chinese/', params='', query='page=2', fragment='')

仅循环path列表和split() ：

# a list of urls
urls = ['https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2']


for url in urls:
# https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/

    l = urlparse(url).path.split('/')
    # ['', 'recipes', '695', 'world-cuisine', 'asian', 'chinese', '']
    
    print(l[3:])
    # ['world-cuisine', 'asian', 'chinese', '']
    
    print('/'.join(l[3:]),'\n')
    # world-cuisine/asian/chinese/

以上全部output：

['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/ 

['soups-stews-and-chili', '']
soups-stews-and-chili/ 

['seafood', 'fish', 'salmon', '']
seafood/fish/salmon/ 

['meat-and-poultry', 'pork', '']
meat-and-poultry/pork/ 

['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/

另一个例子（这次不仅仅是path ）：

for parts in urls:
    print(list(urlparse(parts)), '\n')

Output：

['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', '', ''] 

['https', 'www.allrecipes.com', '/recipes/94/soups-stews-and-chili/', '', '', ''] 

['https', 'www.allrecipes.com', '/recipes/416/seafood/fish/salmon/', '', '', ''] 

['https', 'www.allrecipes.com', '/recipes/205/meat-and-poultry/pork/', '', '', ''] 

['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', 'page=2', '']

如何从URL中提取参数？

问题描述

4 个解决方案

解决方案1
2 2021-08-18 16:40:54

解决方案2
0 已采纳 2021-08-18 16:40:58

解决方案3
0 2021-08-18 16:44:56

解决方案4
0 2021-08-18 23:32:57

如何从URL中提取参数？

问题描述

4 个解决方案

解决方案1 2 2021-08-18 16:40:54

解决方案2 0 已采纳 2021-08-18 16:40:58

解决方案3 0 2021-08-18 16:44:56

解决方案4 0 2021-08-18 23:32:57

解决方案1
2 2021-08-18 16:40:54

解决方案2
0 已采纳 2021-08-18 16:40:58

解决方案3
0 2021-08-18 16:44:56

解决方案4
0 2021-08-18 23:32:57