简体   繁体   English

将列表转换为 pandas 中的 dataframe 时出现 AssertionError

[英]AssertionError when converting a list to dataframe in pandas

I'm trying to store some data I've scraped from an API to a dataframe, then to write it to a.csv.我正在尝试存储一些从 API 刮到 dataframe 的数据,然后将其写入 a.csv。 This works often, but the script sometimes breaks with this error message:这通常有效,但脚本有时会因以下错误消息而中断:

AssertionError: 16 columns passed, passed data had 17 columns AssertionError: 16 列传递,传递的数据有 17 列

Anyone know what's going on here?有人知道这里发生了什么吗? Code is below -- it breaks after "pass one"代码如下 - 它在“通过一个”后中断

from psaw import PushshiftAPI
import datetime as dt
import pandas as pd

api = PushshiftAPI()
start_epoch=int(dt.datetime(2018, 6,2).timestamp())
end_epoch=int(dt.datetime(2018, 12, 31).timestamp())

subreddit = input('Which subreddit would you like to scrape? ')

submission_results = list(api.search_submissions(after=start_epoch,
                                                 before=end_epoch,
                                                 subreddit=subreddit,
                                                 filter=['id', 'title', 'subreddit', 'num_comments', 'score', 'author', 'is_original content', 'is_self', 'stickied', 'selftext',
                  'created_utc', 'locked', 'over_18', 'permalink', 'upvote_ratio',
                  'url'], limit = None))

print ('pass one')

submission_results_df = pd.DataFrame(submission_results)
print ('pass two')
submission_results_df.fillna('NULL')
print('pass three')
submission_results_df.to_csv('D:/CAMER/%s_Submittisons-%s-%s.csv'.format(start_epoch, end_epoch) %(subreddit, start_epoch, end_epoch))

I believe the most likely explanation is that the submissions returned from the query don't all have the same number of fields, and the way you are constructing the dataframe cannot handle this.我相信最有可能的解释是查询返回的提交的字段数量并不相同,而您构建 dataframe 的方式无法处理此问题。 I'm going to suggest two options to work around this, then I'll explain in more detail what I think is happening.我将建议两个选项来解决这个问题,然后我将更详细地解释我认为正在发生的事情。

Option 1: convert to dicts选项1:转换为字典

You could convert each namedtuple record into a dictionary .您可以将每个 namedtuple 记录转换为字典 This should be safer because then pandas won't assume that every record has the same set of fields in the same order.这应该更安全,因为 pandas 不会假设每条记录都具有相同顺序的相同字段集。 If some records have an extra field then pandas will create a column for it and fill it with NaN for all the other records.如果某些记录有一个额外的字段,那么 pandas 将为它创建一个列,并为所有其他记录填充 NaN。

submission_results_df = pd.DataFrame(result._asdict() for result in submission_results)

Option 2: use the psaw CLI instead选项 2:改用 psaw CLI

I note that the psaw library you are using has a command-line interface which can save directly to JSON or CSV.我注意到您使用的psaw库有一个命令行界面,可以直接保存到 JSON 或 CSV。 Perhaps this would avoid your difficulties if you are in fact only using pandas to convert the data to CSV.如果您实际上只使用 pandas 将数据转换为 CSV,那么这可能会避免您的困难。


Explanation解释

I haven't directly reproduced the problem using the data from Redis, but I can explain what appears to be happening here.我没有使用 Redis 中的数据直接重现该问题,但我可以解释这里似乎发生了什么。 submission_results contains a list of namedtuples, created in _wrap_thing . submission_results包含在_wrap_thing中创建的命名元组列表。 (I previously mis-read the source code and thought these were instances of praw.models.reddit.submission but that's only if you have provided a reddit API object during construction.) (我之前误读了源代码,并认为这些是praw.models.reddit.submission的实例,但前提是您在构建过程中提供了 reddit API object。)

The error message "Assertion error: 16 columns passed, passed data had 17 columns" appears to comes from pandas _validate_or_indexify_columns and indicates that it expects 16 columns but has received data for 17 columns.错误消息“断言错误:传递了 16 列,传递的数据有 17 列”似乎来自 pandas _validate_or_indexify_columns ,并表明它需要16 列,但已收到 17 列的数据。 I'm not 100% clear which code-path it took to get here, but I include below an example that gets the same error using namedtuple .我不是 100% 清楚到达这里需要哪个代码路径,但我在下面提供了一个使用namedtuple得到相同错误的示例。

I think it's not a great idea to be passing a list of objects into the DataFrame constructor directly.我认为将对象列表直接传递给 DataFrame 构造函数并不是一个好主意。 The constructor can interpret data in a number of different formats, including some that don't seem to be clearly documented.构造函数可以解释多种不同格式的数据,包括一些似乎没有明确记录的数据。 When it gets a list of named-tuples, it uses the first named-tuple to determine the field names and then converts each item into a list to extract the fields.当它得到一个命名元组列表时,它使用第一个命名元组来确定字段名称,然后将每个项目转换为一个列表以提取字段。 If this is true, then somewhere in your data at least one of the objects has 17 fields instead of 16. I have no idea if psaw makes any particular guarantee that all objects will have the same number of fields, or even if the fields will appear in the same order even when they are the same.如果这是真的,那么在您的数据中至少有一个对象有 17 个字段而不是 16 个。我不知道psaw是否特别保证所有对象都将具有相同数量的字段,或者即使这些字段将即使它们相同,也会以相同的顺序出现。


Related reproduction of the same error message using namedtuple instead:使用namedtuple来相关重现相同的错误消息:

from collections import namedtuple
from pandas import DataFrame

RGB = namedtuple('RGB', 'red green blue')
RGBA = namedtuple('RGBA', 'red green blue alpha')

# This works:
d_okay = DataFrame([RGB(1,2,3),RGB(4,5,6)])

# This fails:
d_bad = DataFrame([RGB(1,2,3),RGB(4,5,6),RGBA(7,8,9,0)])
Traceback (most recent call last):
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 497, in _list_to_arrays
    content, columns, dtype=dtype, coerce_float=coerce_float
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 581, in _convert_object_array
    f"{len(columns)} columns passed, passed data had "
AssertionError: 3 columns passed, passed data had 4 columns

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "repro.py", line 11, in <module>
    d_bad = DataFrame([RGB(1,2,3),RGB(4,5,6),RGBA(7,8,9,0)])
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 474, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 461, in to_arrays
    return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
  File "/home/annette/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 500, in _list_to_arrays
    raise ValueError(e) from e
ValueError: 3 columns passed, passed data had 4 columns

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM