如何修复熊猫中的“无法设置列不匹配的行”错误

Question

I'm creating a web scraper for a project of mine.我正在为我的一个项目创建一个网络抓取工具。 I'm web scraping jobs from indeed.我确实在网上抓取工作。 I'm able to get all the data that I need.我能够获得我需要的所有数据。 Now I'm having a problem creating a dataframe to save it to a CSV file.现在我在创建数据框以将其保存到 CSV 文件时遇到问题。

I have searched for the error and tried many possible solutions but I keep getting the same error.我搜索了错误并尝试了许多可能的解决方案，但我不断收到相同的错误。 Appreciate any suggestions on code or error problem.感谢有关代码或错误问题的任何建议。 Thank you谢谢

ValueError: cannot set a row with mismatched columns

import requests
import bs4
from bs4 import BeautifulSoup

import pandas as pd
import time


max_results_per_city = 30

city_set = ['New+York','Chicago']
columns = ["city", "job_title", "company_name", "location", "summary"]

database = pd.DataFrame(columns = columns)

for city in city_set:
    for start in range(0, max_results_per_city, 10):
        page = requests.get('https://www.indeed.com/jobs?q=computer+science&l=' + str(city) + '&start=' + str(start))
        time.sleep(1)
        soup = BeautifulSoup(page.text, "lxml")
        for div in soup.find_all(name="div", attrs={"class":"row"}):
            num = (len(sample_df) + 1)
            job_post = []
            job_post.append(city)
            for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
                job_post.append(a["title"])
            company = div.find_all(name="span", attrs={"class":"company"})
            if len(company) > 0:
                for b in company:
                    job_post.append(b.text.strip())
            else:
                sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
                for span in sec_try:
                    job_post.append(span.text)
            
            c = div.findAll('div', attrs={'class': 'location'})
            for span in c:
                 job_post.append(span.text)
            d = div.findAll('div', attrs={'class': 'summary'})
            for span in d:
                job_post.append(span.text.strip())
            database.loc[num] = job_post
            database.to_csv("test.csv")

Answer 1

This issue is caused by the # Columns not matching the amount of data (for at least one row).此问题是由 # Columns 与数据量（至少一行）不匹配引起的。

I see a number of issues: Where is 'sample_df' initialized, where are you adding data to 'database' are the big ones that pop out.我看到了许多问题：“sample_df”在哪里初始化，你在哪里向“数据库”添加数据是弹出的大问题。

I'd restructure your code job_post looks like your row level list.我会重组你的代码 job_post 看起来像你的行级列表。 I would use to append to a table level list, so at the end of each loop hit table.append(job_post) instead of sample_df.loc[num] = job_post我会使用附加到表级列表，所以在每个循环的末尾点击table.append(job_post)而不是sample_df.loc[num] = job_post

then after your loop you can call Dataframe(table, columns=columns)然后在你的循环之后你可以调用Dataframe(table, columns=columns)

a note: make sure you're adding None, Null or "" when your scraper can't find data, otherwise your row length wont match your column length, which is what is causing your error.注意：确保在刮板找不到数据时添加 None、Null 或 ""，否则行长度将与列长度不匹配，这就是导致错误的原因。

Answer 2

Reproducing your code, it was not extracting location and database indentation is in the wrong place.重现您的代码，它没有提取location并且database缩进在错误的位置。 So, fix c = div.findAll(name='span', attrs={'class': 'location'}) .因此，修复c = div.findAll(name='span', attrs={'class': 'location'}) 。 Here's a fix that makes it work:这是使其工作的修复程序：

database = []

for city in city_set:
    for start in range(0, max_results_per_city, 10):
        page = requests.get('https://www.indeed.com/jobs?q=computer+science&l=' + str(city) + '&start=' + str(start))
        time.sleep(1)
        soup = BeautifulSoup(page.text, "lxml")
        for div in soup.find_all(name="div", attrs={"class":"row"}):
            #num = (len(sample_df) + 1)
            job_post = []
            job_post.append(city)
            for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
                job_post.append(a["title"])
            company = div.find_all(name="span", attrs={"class":"company"})
            if len(company) > 0:
                for b in company:
                    job_post.append(b.text.strip())
            else:
                sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
                for span in sec_try:
                    job_post.append(span.text)

            c = div.findAll(name='span', attrs={'class': 'location'})
            for span in c:
                 job_post.append(span.text)
            d = div.findAll('div', attrs={'class': 'summary'})
            for span in d:
                job_post.append(span.text.strip())
        database.append(job_post)

df00=pd.DataFrame(database)
df00.shape


df00.columns=columns
df00.to_csv("test.csv",index=False)

如何修复熊猫中的“无法设置列不匹配的行”错误

问题描述

2 个解决方案

解决方案1
0 2019-04-21 22:52:34

解决方案2
0 已采纳 2019-04-22 14:56:45

如何修复熊猫中的“无法设置列不匹配的行”错误

问题描述

2 个解决方案

解决方案1 0 2019-04-21 22:52:34

解决方案2 0 已采纳 2019-04-22 14:56:45

解决方案1
0 2019-04-21 22:52:34

解决方案2
0 已采纳 2019-04-22 14:56:45