[英]How can I scrape data from multiple urls and save these data in the same csv file?
I am using beautifulsoup
to scrape the data.我正在使用
beautifulsoup
来抓取数据。 There are multiple urls and I have to save the data I scrape from these urls in the same CSV file.有多个 url,我必须将从这些 url 中抓取的数据保存在同一个 CSV 文件中。 When I try to scrape from separate files and save to the same CSV file, the data in the last url I scraped in the CSV file is there.
当我尝试从单独的文件中抓取并保存到同一个 CSV 文件时,我在 CSV 文件中抓取的最后一个 url 中的数据就在那里。 Below is the piece of code that I scraped the data from.
下面是我从中抓取数据的一段代码。
images = []
pages = np.arange(1, 2, 1)
for page in pages:
url = "https://www.bkmkitap.com/sanat"
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
book_div = soup.find_all("div", class_="col col-12 drop-down hover lightBg")
sleep(randint(2, 10))
for bookSection in book_div:
img_url = bookSection.find("img", class_="lazy stImage").get('data-src')
images.append(img_url)
books = pd.DataFrame(
{
"Image": images,
} )
books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')
Your question isn't very clear.你的问题不是很清楚。 When you run this, I assume a csv gets created with all the image urls, and you want to rerun this same script and have other image URL's get appended to the same csv?
当你运行这个时,我假设一个 csv 是用所有的图像 URL 创建的,你想重新运行这个相同的脚本并将其他图像 URL 附加到同一个 csv? If that is the case, then you only need to change the
to_csv
function call to:如果是这种情况,那么您只需将
to_csv
function 调用更改为:
books.to_csv("bkm_art.csv", mode='a', index=False, header=False ,encoding = 'utf-8-sig')
Adding mode='a'
starts appending to the file instead of overwriting it ( doc ).添加
mode='a'
开始附加到文件而不是覆盖它( doc )。
Main issue in your example is that you do not get the second page, so you wont get these results - Iterate all of them and then create your CSV.您的示例中的主要问题是您没有获得第二页,因此您不会获得这些结果 - 迭代所有这些结果,然后创建您的 CSV。
Second one, as you want to append data to an existing file, is figured out by @MB第二个,如你想将 append 数据到现有文件中,由@MB 计算出来
Note: Try to avoid selecting your elements by classes, cause they arr more dynamic then id
or HTML structure注意:尽量避免按类选择元素,因为它们比
id
或 HTML 结构更动态
import requests, random
from bs4 import BeautifulSoup
data = []
for page in range(1, 3, 1):
url = f"https://www.bkmkitap.com/sanat?pg={page}"
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
for bookSection in soup.select('[id*="product-detail"]'):
data.append({
'image':bookSection.find("img", class_="lazy stImage").get('data-src')
})
books = pd.DataFrame(data)
books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')
image
0 https://cdn.bkmkitap.com/sanat-dunyamiz-190-ey...
1 https://cdn.bkmkitap.com/sanat-dunyamiz-189-te...
2 https://cdn.bkmkitap.com/tiyatro-gazetesi-sayi...
3 https://cdn.bkmkitap.com/mavi-gok-kultur-sanat...
4 https://cdn.bkmkitap.com/sanat-dunyamiz-iki-ay...
... ...
112 https://cdn.bkmkitap.com/hayal-perdesi-iki-ayl...
113 https://cdn.bkmkitap.com/cins-aylik-kultur-der...
114 https://cdn.bkmkitap.com/masa-dergisi-sayi-48-...
115 https://cdn.bkmkitap.com/istanbul-sanat-dergis...
116 https://cdn.bkmkitap.com/masa-dergisi-sayi-49-...
117 rows × 1 columns
import numpy as np
import pandas as pd
pages = np.arange(1, 2, 1)
for page in pages:
print(page)
try it, you will find you just get 1试试看,你会发现你只得到1
may be you can use也许你可以使用
pages = range(1, 2, 1)
you can use request module of python to request and scrap the data and after that using pandas you can convert it into csv file.您可以使用 python 的请求模块来请求和报废数据,然后使用 pandas 您可以将其转换为 csv 文件。
https://www.tutorialspoint.com/requests/requests_web_scraping_using_requests.html https://www.tutorialspoint.com/requests/requests_web_scraping_using_requests.html
pandas.to_csv() can be used可以使用 pandas.to_csv()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.