简体   繁体   English

Python CSV阅读器

[英]Python CSV Reader

I have a CSV from a system that has a load of rubbish at the top of the file, so the header row is about row 5 or could even be 14 depending on the gibberish the report puts out. 我有一个来自系统的CSV,该系统的文件顶部有很多垃圾,因此标题行大约在第5行,甚至可能是14行,具体取决于报表输出的垃圾。

I used to use: 我曾经用过:

idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)

to go through the rows that had less than 2 columns, then when it hit the col headers, of which there are 12, it would stop, and then I could use idx with skiprows when reading the CSV file. 来浏览少于两列的行,然后当它到达col标题(其中有12个)时,它将停止,然后在读取CSV文件时可以将idx与跳过行一起使用。

The system has had an update and someone thought it would be good to have the CSV file valid by adding in 11 blank commas after their gibberish to align the header count. 系统已进行了更新,有人认为最好通过在乱码后添加11个空白逗号来使标头计数对齐来使CSV文件有效。

so now I have a CSV like: 所以现在我有一个CSV像:

sadjfhasdkljfhasd,,,,,,,,,,
dsfasdgasfg,,,,,,,,,,
time,date,code,product 

etc.. 等等..

I tried: 我试过了:

idx = next(idx for idx, row in enumerate(csvreader) if row in (None, "") > 2)

but I think that's a Pandas thing and it just fails. 但是我认为这是熊猫的事情,但是失败了。

Any ideas on how i can get to my header row? 关于如何到达标题行的任何想法吗?

CODE: 码:

lmf = askopenfilename(filetypes=(("CSV Files",".csv"),("All Files","*.*")))
    # Section gets row number where headers start
    with open(lmf, 'r') as fin:
        csvreader = csv.reader(fin)
        print(csvreader)
        input('hold')
        idx = next(idx for idx, row in enumerate(csvreader) if len(row) > 2)

    # Reopens file parsing the number for the row headers
    lmkcsv = pd.read_csv(lmf, skiprows=idx)
    lm = lm.append(lmkcsv)
    print(lm)

Since your csv is now a valid file and you just want to filter out the header rows without a certain amount of columns, you can just do that in pandas directly. 由于您的csv现在是有效文件,并且您只想过滤掉标题行而没有一定数量的列,因此可以直接在pandas执行此操作。

import pandas as pd
minimum_cols_required = 3
lmkcsv = pd.read_csv()
lmkcsv = lmkcsv.dropna(thresh=minimum_cols_required, inplace=True)

If your csv data have a lot of empty values as well that gets caught in this threshold, then just slightly modify your code: 如果您的csv数据也包含很多空值,并且都陷入了此阈值,那么只需稍微修改一下代码即可:

idx = next(idx for idx, row in enumerate(csvreader) if len(set(row)) > 3)

I'm not sure in what case a None would return, so the set(row) should do. 我不确定在什么情况下会返回None ,所以set(row)应该这样做。 If your headers for whatever are duplicates as well, do this: 如果您的标头也是重复的,请执行以下操作:

from collections import Counter
# ...
idx = next(idx for idx, row in enumerate(csvreader) if len(row) - Counter(row)[''] > 2)

And how about erasing the starting lines, doing some logic, like checking many ',' exist's or some word. 以及如何删除起跑线,执行一些逻辑,例如检查许多“,”存在或某个单词。 Something like: 就像是:

f = open("target.txt","r+")
d = f.readlines()
f.seek(0)
for i in d:
    if "sadjfhasdkljfhasd" not in i:
        f.write(i)
f.truncate()

f.close()

after that, read normaly the file. 之后,请正常读取文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM