简体   繁体   English

如何从多个 url 中抓取数据并将这些数据保存在同一个 csv 文件中?

[英]How can I scrape data from multiple urls and save these data in the same csv file?

I am using beautifulsoup to scrape the data.我正在使用beautifulsoup来抓取数据。 There are multiple urls and I have to save the data I scrape from these urls in the same CSV file.有多个 url,我必须将从这些 url 中抓取的数据保存在同一个 CSV 文件中。 When I try to scrape from separate files and save to the same CSV file, the data in the last url I scraped in the CSV file is there.当我尝试从单独的文件中抓取并保存到同一个 CSV 文件时,我在 CSV 文件中抓取的最后一个 url 中的数据就在那里。 Below is the piece of code that I scraped the data from.下面是我从中抓取数据的一段代码。

images = []
pages = np.arange(1, 2, 1)
for page in pages:
    url = "https://www.bkmkitap.com/sanat"
    results = requests.get(url, headers=headers)
    soup = BeautifulSoup(results.content, "html.parser")
    book_div = soup.find_all("div", class_="col col-12 drop-down hover lightBg")
    sleep(randint(2, 10))
    for bookSection in book_div:
        img_url = bookSection.find("img", class_="lazy stImage").get('data-src')
        images.append(img_url)  
books = pd.DataFrame(
    {
        "Image": images,
} )
books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')

Your question isn't very clear.你的问题不是很清楚。 When you run this, I assume a csv gets created with all the image urls, and you want to rerun this same script and have other image URL's get appended to the same csv?当你运行这个时,我假设一个 csv 是用所有的图像 URL 创建的,你想重新运行这个相同的脚本并将其他图像 URL 附加到同一个 csv? If that is the case, then you only need to change the to_csv function call to:如果是这种情况,那么您只需将to_csv function 调用更改为:

books.to_csv("bkm_art.csv", mode='a', index=False, header=False ,encoding = 'utf-8-sig')

Adding mode='a' starts appending to the file instead of overwriting it ( doc ).添加mode='a'开始附加到文件而不是覆盖它( doc )。

Main issue in your example is that you do not get the second page, so you wont get these results - Iterate all of them and then create your CSV.您的示例中的主要问题是您没有获得第二页,因此您不会获得这些结果 - 迭代所有这些结果,然后创建您的 CSV。

Second one, as you want to append data to an existing file, is figured out by @MB第二个,如你想将 append 数据到现有文件中,由@MB 计算出来

Note: Try to avoid selecting your elements by classes, cause they arr more dynamic then id or HTML structure注意:尽量避免按类选择元素,因为它们比id或 HTML 结构更动态

Example例子

import requests, random
from bs4 import BeautifulSoup

data = []

for page in range(1, 3, 1):
    url = f"https://www.bkmkitap.com/sanat?pg={page}"
    results = requests.get(url, headers=headers)
    soup = BeautifulSoup(results.content, "html.parser")
    
    for bookSection in soup.select('[id*="product-detail"]'):
        data.append({
            'image':bookSection.find("img", class_="lazy stImage").get('data-src')
        })
books = pd.DataFrame(data)

books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')

Output Output

    image
0   https://cdn.bkmkitap.com/sanat-dunyamiz-190-ey...
1   https://cdn.bkmkitap.com/sanat-dunyamiz-189-te...
2   https://cdn.bkmkitap.com/tiyatro-gazetesi-sayi...
3   https://cdn.bkmkitap.com/mavi-gok-kultur-sanat...
4   https://cdn.bkmkitap.com/sanat-dunyamiz-iki-ay...
... ...
112 https://cdn.bkmkitap.com/hayal-perdesi-iki-ayl...
113 https://cdn.bkmkitap.com/cins-aylik-kultur-der...
114 https://cdn.bkmkitap.com/masa-dergisi-sayi-48-...
115 https://cdn.bkmkitap.com/istanbul-sanat-dergis...
116 https://cdn.bkmkitap.com/masa-dergisi-sayi-49-...
117 rows × 1 columns
import numpy as np
import pandas as pd
pages = np.arange(1, 2, 1)
for page in pages:
    print(page)

try it, you will find you just get 1试试看,你会发现你只得到1

may be you can use也许你可以使用

pages = range(1, 2, 1)

you can use request module of python to request and scrap the data and after that using pandas you can convert it into csv file.您可以使用 python 的请求模块来请求和报废数据,然后使用 pandas 您可以将其转换为 csv 文件。

https://www.tutorialspoint.com/requests/requests_web_scraping_using_requests.html https://www.tutorialspoint.com/requests/requests_web_scraping_using_requests.html

pandas.to_csv() can be used可以使用 pandas.to_csv()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从多个 url 中抓取数据以保存为单个按日期排序的 csv 文件? - How to scrape data from multiple urls to be saved as a single, date ordered csv file? 美丽的汤,如何抓取多个网址并将其保存在csv文件中 - Beautiful soup, how to scrape multiple urls and save them in a csv file 通过从 CSV 文件中的数据手动创建来自多个 URL 的数据 - Scrape data from multiple URLs by manually creating them from data in a CSV file 如何将抓取数据保存到 CSV 文件中? - How To Save Scrape Data Into CSV File? 如何从同一csv行中的多个页面抓取数据? - how to scrape data from multiple pages in the same csv row? 如何循环遍历多个 URL 以从 Scrapy 中的 CSV 文件中抓取? - How to loop through multiple URLs to scrape from a CSV file in Scrapy? 从CSV文件中抓取多个网址 - Scrape multiple urls from csv file 如何从 Python 中的 CSV 文件中抓取特定数据? - How do I scrape specific data from a CSV file in Python? 如何在 url 中抓取具有不同数据的多个网站 - How to scrape multiple websites with different data in urls 如何从存储在 csv 文件中的多个 url 中提取 API 数据 (a),然后将每个单独的数据集/表保存到它自己的 csv 文件中(b、c、d 等) - How to extract API data from multiple urls stored in a csv file (a), then save each individual dataset/table to its own csv file (b,c,d etc)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM