简体   繁体   中英

scraping multiple websites at a same time using scrapy

i was trying to scrape multiple urls (blogs contain list of popular dishes) no matter how dynamic the site is. I'm tying to get headlines which are mostly contained in h2,h3 the code is working fine but i am facing issue in the output i can't scrape full headings from this URL https://www.holidify.com/pages/street-food-in-jaipur-1483.html look

'1. Golgappa at ', '2. Pyaaz Kachori at ', '3. Masala Chai at ', '4. Best of Indian Street Food at 
', '5. Kaathi Roll at ', '6. Pav Bhaji at ', '7. Omelette at ', '8. Chicken Tikka at ', '9
. Lassi at ', '10. Shrikhand at ', '11. Kulfi Faluda at ', '12. Sweets from Laxmi Mishthan Bhandar 
(LMB)', "13. Fast Food at Aunty's Cafe", '14. Cold Coffee at Gyan Vihar Dairy (GVD)', '  To
p Hotels In Jaipur  ', ' Jaipur Packages ', '  Top Places in Jaipur  ', '  Recently Published  '

and i don't know why i am getting this type of output

['1.\tDal Bhatti Churma', '2.\tGhewar', '3.\tMawa Kachori', '4.\tMirchi Bada', '5.\tGatte Ki 
Subzi', '6.\tRajasthani Thali', '7.\tLaal Maas', '8.\tKeema Baati', 'Food experiences in Jaipur:
', 'Tempted?']

from https://trip101.com/article/8-best-rajasthani-foods-in-jaipur

import scrapy
import re

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.lih.travel','www.crazymasalafood.com','www.holidify.com','www.trip101.com','www.theindiantrip.com',]
    #start_urls=['https://www.holidify.com/pages/street-food-in-jaipur-1483.html']
    start_urls = ['https://www.crazymasalafood.com/top-20-dishes-must-eat-jaipur/',
                  'https://trip101.com/article/8-best-rajasthani-foods-in-jaipur',
                  'https://theindiantrip.com/at/best-famous-street-food-in-jaipur-guide',
                  'https://www.lih.travel/famous-foods-in-jaipur/',
                  'https://www.holidify.com/pages/street-food-in-jaipur-1483.html']

    def parse(self, response):

        # yield {
        #     'title':response.css('h2::text').getall()
        # }
        if response.css('h3::text').re(r'\d+\.\s*\w+'):
            print(response.css('h3::text').re(r'\d+\.\s*\w+'))
            print('first case')

        elif response.css('h2::text').re(r'\d+\.\s*\w+'):
            print(response.css('h2::text').getall())
            print('second case')

        elif response.css('h3::text'):
            print(response.css('h3::text').getall())
            print('third case')
        else:
            print('something is wrong')

please guys, any kind of solution or suggestion will be appreciated

this can be done by newspaper library

import re
from newspaper import Article
import nltk
from pprint import pprint

urls=['https://www.jaipurcityblog.com/9-iconic-famous-dishes-of-jaipur-that-you-have-to-try/',

                  'https://www.adequatetravel.com/blog/famous-foods-in-jaipur-you-must-try/',

                  'https://www.lih.travel/famous-foods-in-jaipur/',
                  'https://www.holidify.com/pages/street-food-in-jaipur-1483.html']
extacted_data=[]
for url in urls:
    site = Article(url)

    site.download()
    site.parse()
    site.nlp()
    data= site.text
    pattern=re.findall(r'\d+\.\s*[a-zA-Z]+.*',data)
    print(pattern)

output:

['1. Dal Baati Churma', '2. Pyaaz Ki Kachori', '3. Gatte ki Sabji', '4. Mawa Kachori', '5. Kalakand', '6. Lassi', '7. Aam ki Launji', '8. Chokhani Kheer', '9. Mirchi   Vada']
['1. Keema Baati', '2. Pyaaz Kachori', '3. Dal Baati Churma', '4. Shrikhand', '5. Ghewar', '6. Mawa Kachori', '7. Mirchi Bada', '8. Gatte Ki Subzi', '9. Rajasthani Thali',     '10. Laal Maas']
['1. Rajasthani Thali (Plate) at Chokhi Dhani Village Resort', '2. Laal Maans at Handi', '3. Lassi at Lassiwala', '4. Anokhi Café for Penne Pasta & Cheese Cake', '5. Daal  Baluchi at Baluchi Restaurant', '6. Pyaz Kachori at Rawat', '7. Chicken Lollipop at Niro’s', '8. Hibiscus Ice Tea at Tapri', '9. Omelet at Sanjay Omelette', '1981. This    special egg eatery of Jaipur also treats some never tried before egg specialties. If you are an egg-fan with a sweet tooth, then this is your place. Slurp the “Egg Rabri”  of Sanjay Omelette and feel the heavenly juice of eggs in your mouth. Appreciate the good taste of egg in never before way with just a visit to “Sanjay Omelette”.', '10.   Paalak Paneer & Missi Roti at Sharma Dhabha']
["1. Golgappa at Chawla's and Nand's", '2. Pyaaz Kachori at Rawat Mishthan Bhandar', '3. Masala Chai at Gulab Ji Chaiwala', '4. Best of Indian Street Food at Masala    Chowk', '5. Kaathi Roll at Al Bake', "6. Pav Bhaji at Pandit's", "7. Omelette at Sanjay's", '8. Chicken Tikka at Sethi Bar-Be-Que', '9. Lassi at Lassiwala', '10. Shrikhand     at Falahaar', '11. Kulfi Faluda at Bapu Bazaar', '12. Sweets from Laxmi Mishthan Bhandar (LMB)', "13. Fast Food at Aunty's Cafe", '14. Cold Coffee at Gyan Vihar Dairy  (GVD)']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM