简体   繁体   English

从 pandas csv 读取时,我可以使用制表符或逗号作为分隔符吗?

[英]May I use either tab or comma as delimiter when reading from pandas csv?

I have csv files.我有 csv 文件。 Some are comma delimited, and some are tab delimited.有些是逗号分隔的,有些是制表符分隔的。

df = pd.read_csv(data_file, sep='\t')

Is there a way to specify either tab or comma as delimiter when using pd.read_csv()?使用 pd.read_csv() 时,有没有办法将制表符或逗号指定为分隔符? Or, is there a way to automatically detect whether the file is tab or comma delimited?或者,有没有办法自动检测文件是制表符还是逗号分隔? If I know that, I can use different sep='' paramters when reading the file.如果我知道,我可以在读取文件时使用不同的 sep='' 参数。

Recently I had a similair problem, I ended up using a different method but I explored using this Class.最近我遇到了一个类似的问题,我最终使用了另一种方法,但我探索了使用这个 Class。

You can use the standard CSV module from the standard library.您可以使用标准库中的标准CSV模块。 Specifically using the Sniffer Class具体使用嗅探器 Class

from the documentation从文档

"Sniffs" the format of a CSV file (ie delimiter, quotechar) Returns a Dialect object. “嗅探” CSV 文件的格式(即分隔符、引号字符)返回方言 object。

you can return the dialect object then pass dialect.delimiter to the sep arg in pd.read_csv您可以返回方言 object 然后将 dialect.delimiter 传递给pd.read_csv中的sep arg

'text_a.csv'

cola|colb|col
A|B|C
E|F|G
A|B|C
E|F|G

'text_b.csv'

cola\tcolb\tcol
A\tB\tC
E\tF\tG
A\tB\tC
E\tF\tG
A\tB\tC

from csv import Sniffer

sniffer = Sniffer()

def detect_delim(file,num_rows,sniffer):

    with open(file,'r') as f:
        for row in range(num_rows):
            line = next(f).strip()
            delim = sniffer.sniff(line)
    print(delim.delimiter) # ideally you should return the dialect object - just being lazy.

detect_delim(file='text_a.csv',num_rows=5,sniffer=sniffer)
'|'
detect_delim(file='text_b.csv',num_rows=5,sniffer=sniffer)
'\t'

I'd just read the first row and see which gives you more columns:我只是阅读了第一行,看看哪一行给了你更多的列:

import pandas as pd
tab = pd.read_csv(data_file, nrows=1, sep='\t').shape[1]
com = pd.read_csv(data_file, nrows=1, sep=',').shape[1]
if tab > com:
    df = pd.read_csv(data_file, sep='\t')
else:
    df = pd.read_csv(data_file, sep=',')

Is this useful ? 这有用吗? You can used python regex parser with read_csv and specify different delimiters.您可以将 python 正则表达式解析器与read_csv一起使用并指定不同的分隔符。

Ask the user to specify how the file is formatted if you don't expect to be able to determine from the file contents itself.如果您不希望能够从文件内容本身确定,请让用户指定文件的格式。

Eg a flag of some sort as --tab-delimited-file=true and then you flip the separator based on their input.例如,某种标志为--tab-delimited-file=true ,然后根据它们的输入翻转分隔符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM