简体   繁体   中英

How to extract parameters from URL?

url = 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/'
url2 = 'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
new = url.split("/")[-4:]
new2 = url2.split("/")[-2:]
print(new)
print(new2)

Output : ['world-cuisine', 'asian', 'chinese', ''] 
         ['soups-stews-and-chili', '']
  • The output I need is ['world-cuisine', 'asian', 'chinese'] & ['soups-stews-and-chili'].
  • The URLs have different parameters I am not able to get around all the URL and extract only the main parameters after the numbers
  • And also the '/' at end of the URL is necessary because in Scrapy when I use a URL w/o '/' it throws a 301 error but as you can see from the output there is an extra '' because of the backslash which I am not able to omit.
  • what can I do to get the parameter for all sorts of URLs?

some other examples of the URLs are:

'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/'

'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/'

  • how can we write the rule to follow such pagination for such URLs 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2'

    Rule(LinkExtractor(allow=(r'recipes/?page=\d+',)), follow=True)

I am new to scrapy and regex and hence i would really appreciate you help on this problem

You can combine re module + str.split :

import re

urls = [
    "https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/",
    "https://www.allrecipes.com/recipes/94/soups-stews-and-chili/",
    "https://www.allrecipes.com/recipes/416/seafood/fish/salmon/",
    "https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/",
]

r = re.compile(r"(?:\d+/)(.*)/")

for url in urls:
    print(r.search(url).group(1).split("/"))

Prints:

['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']

I'm not 100% sure if I correctly understood your question, but I think the following code gets you what you need.

EDIT
Updated code after comment interaction

urls = [
    'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
    'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
    'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
    'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
    'https://www.allrecipes.com/recipes/qqqq/94/soups-stews-and-chili/x/y/z/q'
]

for url in urls:
    for index, part in enumerate(url.split('/')):
        if part.isnumeric():
            start = index+1
            break
    print(url.split('/')[start:-1])

output

['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['soups-stews-and-chili', 'x', 'y', 'z']

old answer

urls = [
    'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
    'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
    'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
    'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
]

for url in urls:
    print(url.split("/")[5:-1])

output

['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']

Something like this. The idea is to find the 'int' path element and fetch all path elements from its right side.

from collections import defaultdict
from typing import Dict, List

urls = ['https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
        'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/']


def is_int(param: str) -> bool:
    try:
        int(param)
        return True
    except ValueError:
        return False


data: Dict[str, List[str]] = defaultdict(list)
for url in urls:
    elements = url.split('/')
    elements.reverse()
    loop = True
    while loop:
        for element in elements:
            if len(element.strip()) < 1:
                continue
            if not is_int(element):
                data[url].append(element)
            else:
                loop = False
                break
print(data)

output

defaultdict(<class 'list'>, {'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/': ['salmon', 'fish', 'seafood'], 'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/': ['pork', 'meat-and-poultry']})

When dealing with a url try to avoid (or a least delay) regex and look first to urllib or similar and/or split() .

Just one url with full details:

from urllib.parse import urlparse

urlparse(urls[4])

ParseResult(scheme='https', netloc='www.allrecipes.com', path='/recipes/695/world-cuisine/asian/chinese/', params='', query='page=2', fragment='')

Looping the list for path only and split() :

# a list of urls
urls = ['https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2']


for url in urls:
# https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/

    l = urlparse(url).path.split('/')
    # ['', 'recipes', '695', 'world-cuisine', 'asian', 'chinese', '']
    
    print(l[3:])
    # ['world-cuisine', 'asian', 'chinese', '']
    
    print('/'.join(l[3:]),'\n')
    # world-cuisine/asian/chinese/

Full output of the above:

['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/ 

['soups-stews-and-chili', '']
soups-stews-and-chili/ 

['seafood', 'fish', 'salmon', '']
seafood/fish/salmon/ 

['meat-and-poultry', 'pork', '']
meat-and-poultry/pork/ 

['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/ 

Another example (not just the path this time):

for parts in urls:
    print(list(urlparse(parts)), '\n')

Output:

['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', '', ''] 

['https', 'www.allrecipes.com', '/recipes/94/soups-stews-and-chili/', '', '', ''] 

['https', 'www.allrecipes.com', '/recipes/416/seafood/fish/salmon/', '', '', ''] 

['https', 'www.allrecipes.com', '/recipes/205/meat-and-poultry/pork/', '', '', ''] 

['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', 'page=2', '']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM