[英]How to fix "cannot set a row with mismatched columns" error in pandas
I'm creating a web scraper for a project of mine.我正在为我的一个项目创建一个网络抓取工具。 I'm web scraping jobs from indeed.
我确实在网上抓取工作。 I'm able to get all the data that I need.
我能够获得我需要的所有数据。 Now I'm having a problem creating a dataframe to save it to a CSV file.
现在我在创建数据框以将其保存到 CSV 文件时遇到问题。
I have searched for the error and tried many possible solutions but I keep getting the same error.我搜索了错误并尝试了许多可能的解决方案,但我不断收到相同的错误。 Appreciate any suggestions on code or error problem.
感谢有关代码或错误问题的任何建议。 Thank you
谢谢
ValueError: cannot set a row with mismatched columns
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
max_results_per_city = 30
city_set = ['New+York','Chicago']
columns = ["city", "job_title", "company_name", "location", "summary"]
database = pd.DataFrame(columns = columns)
for city in city_set:
for start in range(0, max_results_per_city, 10):
page = requests.get('https://www.indeed.com/jobs?q=computer+science&l=' + str(city) + '&start=' + str(start))
time.sleep(1)
soup = BeautifulSoup(page.text, "lxml")
for div in soup.find_all(name="div", attrs={"class":"row"}):
num = (len(sample_df) + 1)
job_post = []
job_post.append(city)
for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
job_post.append(a["title"])
company = div.find_all(name="span", attrs={"class":"company"})
if len(company) > 0:
for b in company:
job_post.append(b.text.strip())
else:
sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
for span in sec_try:
job_post.append(span.text)
c = div.findAll('div', attrs={'class': 'location'})
for span in c:
job_post.append(span.text)
d = div.findAll('div', attrs={'class': 'summary'})
for span in d:
job_post.append(span.text.strip())
database.loc[num] = job_post
database.to_csv("test.csv")
This issue is caused by the # Columns not matching the amount of data (for at least one row).此问题是由 # Columns 与数据量(至少一行)不匹配引起的。
I see a number of issues: Where is 'sample_df' initialized, where are you adding data to 'database' are the big ones that pop out.我看到了许多问题:“sample_df”在哪里初始化,你在哪里向“数据库”添加数据是弹出的大问题。
I'd restructure your code job_post looks like your row level list.我会重组你的代码 job_post 看起来像你的行级列表。 I would use to append to a table level list, so at the end of each loop hit
table.append(job_post)
instead of sample_df.loc[num] = job_post
我会使用附加到表级列表,所以在每个循环的末尾点击
table.append(job_post)
而不是sample_df.loc[num] = job_post
then after your loop you can call Dataframe(table, columns=columns)
然后在你的循环之后你可以调用
Dataframe(table, columns=columns)
a note: make sure you're adding None, Null or "" when your scraper can't find data, otherwise your row length wont match your column length, which is what is causing your error.注意:确保在刮板找不到数据时添加 None、Null 或 "",否则行长度将与列长度不匹配,这就是导致错误的原因。
Reproducing your code, it was not extracting location
and database
indentation is in the wrong place.重现您的代码,它没有提取
location
并且database
缩进在错误的位置。 So, fix c = div.findAll(name='span', attrs={'class': 'location'})
.因此,修复
c = div.findAll(name='span', attrs={'class': 'location'})
。 Here's a fix that makes it work:这是使其工作的修复程序:
database = []
for city in city_set:
for start in range(0, max_results_per_city, 10):
page = requests.get('https://www.indeed.com/jobs?q=computer+science&l=' + str(city) + '&start=' + str(start))
time.sleep(1)
soup = BeautifulSoup(page.text, "lxml")
for div in soup.find_all(name="div", attrs={"class":"row"}):
#num = (len(sample_df) + 1)
job_post = []
job_post.append(city)
for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
job_post.append(a["title"])
company = div.find_all(name="span", attrs={"class":"company"})
if len(company) > 0:
for b in company:
job_post.append(b.text.strip())
else:
sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
for span in sec_try:
job_post.append(span.text)
c = div.findAll(name='span', attrs={'class': 'location'})
for span in c:
job_post.append(span.text)
d = div.findAll('div', attrs={'class': 'summary'})
for span in d:
job_post.append(span.text.strip())
database.append(job_post)
df00=pd.DataFrame(database)
df00.shape
df00.columns=columns
df00.to_csv("test.csv",index=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.