url = 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/'
url2 = 'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
new = url.split("/")[-4:]
new2 = url2.split("/")[-2:]
print(new)
print(new2)
Output : ['world-cuisine', 'asian', 'chinese', '']
['soups-stews-and-chili', '']
some other examples of the URLs are:
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/'
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/'
how can we write the rule to follow such pagination for such URLs 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2'
Rule(LinkExtractor(allow=(r'recipes/?page=\d+',)), follow=True)
I am new to scrapy and regex and hence i would really appreciate you help on this problem
You can combine re
module + str.split
:
import re
urls = [
"https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/",
"https://www.allrecipes.com/recipes/94/soups-stews-and-chili/",
"https://www.allrecipes.com/recipes/416/seafood/fish/salmon/",
"https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/",
]
r = re.compile(r"(?:\d+/)(.*)/")
for url in urls:
print(r.search(url).group(1).split("/"))
Prints:
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
I'm not 100% sure if I correctly understood your question, but I think the following code gets you what you need.
EDIT
Updated code after comment interaction
urls = [
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
'https://www.allrecipes.com/recipes/qqqq/94/soups-stews-and-chili/x/y/z/q'
]
for url in urls:
for index, part in enumerate(url.split('/')):
if part.isnumeric():
start = index+1
break
print(url.split('/')[start:-1])
output
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['soups-stews-and-chili', 'x', 'y', 'z']
old answer
urls = [
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
]
for url in urls:
print(url.split("/")[5:-1])
output
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
Something like this. The idea is to find the 'int' path element and fetch all path elements from its right side.
from collections import defaultdict
from typing import Dict, List
urls = ['https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/']
def is_int(param: str) -> bool:
try:
int(param)
return True
except ValueError:
return False
data: Dict[str, List[str]] = defaultdict(list)
for url in urls:
elements = url.split('/')
elements.reverse()
loop = True
while loop:
for element in elements:
if len(element.strip()) < 1:
continue
if not is_int(element):
data[url].append(element)
else:
loop = False
break
print(data)
output
defaultdict(<class 'list'>, {'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/': ['salmon', 'fish', 'seafood'], 'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/': ['pork', 'meat-and-poultry']})
When dealing with a url try to avoid (or a least delay) regex
and look first to urllib
or similar and/or split()
.
Just one url with full details:
from urllib.parse import urlparse
urlparse(urls[4])
ParseResult(scheme='https', netloc='www.allrecipes.com', path='/recipes/695/world-cuisine/asian/chinese/', params='', query='page=2', fragment='')
Looping the list for path
only and split()
:
# a list of urls
urls = ['https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2']
for url in urls:
# https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/
l = urlparse(url).path.split('/')
# ['', 'recipes', '695', 'world-cuisine', 'asian', 'chinese', '']
print(l[3:])
# ['world-cuisine', 'asian', 'chinese', '']
print('/'.join(l[3:]),'\n')
# world-cuisine/asian/chinese/
Full output of the above:
['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/
['soups-stews-and-chili', '']
soups-stews-and-chili/
['seafood', 'fish', 'salmon', '']
seafood/fish/salmon/
['meat-and-poultry', 'pork', '']
meat-and-poultry/pork/
['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/
Another example (not just the path
this time):
for parts in urls:
print(list(urlparse(parts)), '\n')
Output:
['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/94/soups-stews-and-chili/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/416/seafood/fish/salmon/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/205/meat-and-poultry/pork/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', 'page=2', '']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.