简体   繁体   English

使用 Pandas 将 CSV 读入具有不同行长的数据帧

[英]Read CSV into a dataFrame with varying row lengths using Pandas

So I have a CSV that looks a bit like this:所以我有一个 CSV 看起来有点像这样:

1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...

And when I try to use the following code to generate a dataFrame..当我尝试使用以下代码生成数据帧时..

df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)

It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)它只将 3 列的行添加到 df(上面的第 1、3 和 5 行)

The rest are considered 'bad lines' giving me the following error:其余的被认为是“坏线”给我以下错误:

Skipping line 17467: expected 3 fields, saw 9

How do I create a data frame that includes all data in my csv, possibly just filling in the empty cells with null?如何创建一个包含我的 csv 中所有数据的数据框,可能只是用 null 填充空单元格? Or do I have to declare the max row length prior to adding to the df?或者我是否必须在添加到 df 之前声明最大行长度?

Thanks!谢谢!

If using only pandas , read in lines, deal with the separator after.如果仅使用pandas ,请pandas阅读,然后处理分隔符。

import pandas as pd

df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)

   0           1    2     3     4     5     6
0  1  01-01-2019  724  None  None  None  None
1  2  01-01-2019  233   436  None  None  None
2  3  01-01-2019  345  None  None  None  None
3  4  01-01-2019  803   933   943   923   954
4  5  01-01-2019  454  None  None  None  None

If you know that the data contains N columns, you can tell Pandas in advance how many columns to expect via the names parameter:如果您知道数据包含N列,您可以通过names参数提前告诉 Pandas 需要多少列:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)

yields产量

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

If you have an the upper limit, N , on the number of columns, then you can have Pandas read N columns and then use dropna to drop completely empty columns:如果您有列数的上限N ,那么您可以让 Pandas 读取N列,然后使用dropna删除完全空的列:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)

yields产量

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

Note that this could drop columns from the middle of the data set (not just columns from the right-hand side) if they are completely empty.请注意,如果它们完全为空,这可能会从数据集的中间删除列(不仅仅是右侧的列)。

Read fixed width should work:读取固定宽度应该有效:

from io import StringIO

s = '''1  01-01-2019  724
2  01-01-2019  233  436
3  01-01-2019  345
4  01-01-2019  803  933  943  923  954
5  01-01-2019  454'''


pd.read_fwf(StringIO(s), header=None)

   0           1    2      3      4      5      6
0  1  01-01-2019  724    NaN    NaN    NaN    NaN
1  2  01-01-2019  233  436.0    NaN    NaN    NaN
2  3  01-01-2019  345    NaN    NaN    NaN    NaN
3  4  01-01-2019  803  933.0  943.0  923.0  954.0
4  5  01-01-2019  454    NaN    NaN    NaN    NaN

or with a delimiter param或带有delimiter参数

s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''


pd.read_fwf(StringIO(s), header=None, delimiter='|')

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

note that for your actual file you will not use StringIO you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)请注意,对于您的实际文件,您不会使用StringIO您只需将其替换为您的文件路径: pd.read_fwf('data.csv', delimiter='|', header=None)

add extra columns (empty or otherwise) to the top of your csv file.在 csv 文件的顶部添加额外的列(空或其他)。 Pandas will takes the first row as the default size, and anything below it will have NaN values. Pandas 将第一行作为默认大小,它下面的任何内容都将具有 NaN 值。 Example:例子:

file.csv:文件.csv:

a,b,c,d,e
1,2,3
3
2,3,4

code:代码:

>>> import pandas as pd
>>> pd.read_csv('file.csv')
   a    b    c   d   e
0  1  2.0  3.0 NaN NaN
1  3  NaN  NaN NaN NaN
2  2  3.0  4.0 NaN NaN

Consider using Python csv to do the lifting for importing data and format grooming.考虑使用 Python csv来完成导入数据和格式整理的工作。 You can implement a custom dialect to handle varying csv-ness.您可以实现自定义方言来处理不同的 csv-ness。

import csv
import pandas as pd

csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""

with open('test1.csv', 'w') as f:
    f.write(csv_data)

csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
    data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)

Gives you a csv import dialect and the following DataFrame:为您提供 csv 导入方言和以下 DataFrame:

    0             1      2      3      4      5     6
0  1    01-01-2019     724   None   None   None  None
1  2    01-01-2019    233     436   None   None  None
2  3    01-01-2019     345   None   None   None  None
3  4    01-01-2019    803    933    943    923    954
4  5    01-01-2019     454   None   None   None  None

Left as an exercise is handling the whitespace padding in the input file.剩下的练习是处理输入文件中的空白填充。

colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)

Change 9 in colnames to number x if code gives the error如果代码给出错误,则将列名中的9更改为数字x

Skipping line 17467: expected 3 fields, saw x

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将熊猫数据框写入具有不同行长的csv - How to write pandas dataframe to a csv with varying row length Pandas - 将CSV读入数据框,其中列具有不同数量的子列 - Pandas - read CSV into dataframe, where a column has a varying number of subcolumns 如何明智地读取csv行并使用pandas转换为dataframe列? - How to read a csv row wise and convert to a dataframe column using pandas? Pandas 数据框列,其中包含不同列的不同长度的列表 - Pandas dataframe column with lists of lists of varying lengths to different columns 使用带有不同长度的逗号分隔字符串的 Pandas str.split - Using Pandas str.split with varying lengths of comma separated strings 如何使用 Pandas 将 csv 中的多行读取到单个数据帧行中 - How to read multiple lines from csv into a single dataframe row with pandas 如何将csv文件读取到具有多行索引级别的pandas DataFrame中? - How to read csv file into pandas DataFrame with multiple row index level? 具有一行数据的熊猫read_csv()返回Empty Dataframe - Pandas read_csv() with one row of data returns Empty Dataframe 无法从csv读取的熊猫数据框中选择一行 - Can't select a row from a pandas dataframe, read from a csv Python Pandas:将具有不同长度的列表作为一行分配给 Pandas 数据框 - Python Pandas: Assign lists with different lengths as a row to pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM