[英]May I use either tab or comma as delimiter when reading from pandas csv?
I have csv files.我有 csv 文件。 Some are comma delimited, and some are tab delimited.
有些是逗号分隔的,有些是制表符分隔的。
df = pd.read_csv(data_file, sep='\t')
Is there a way to specify either tab or comma as delimiter when using pd.read_csv()?使用 pd.read_csv() 时,有没有办法将制表符或逗号指定为分隔符? Or, is there a way to automatically detect whether the file is tab or comma delimited?
或者,有没有办法自动检测文件是制表符还是逗号分隔? If I know that, I can use different sep='' paramters when reading the file.
如果我知道,我可以在读取文件时使用不同的 sep='' 参数。
Recently I had a similair problem, I ended up using a different method but I explored using this Class.最近我遇到了一个类似的问题,我最终使用了另一种方法,但我探索了使用这个 Class。
You can use the standard CSV
module from the standard library.您可以使用标准库中的标准
CSV
模块。 Specifically using the Sniffer Class具体使用嗅探器 Class
from the documentation从文档
"Sniffs" the format of a CSV file (ie delimiter, quotechar) Returns a Dialect object.
“嗅探” CSV 文件的格式(即分隔符、引号字符)返回方言 object。
you can return the dialect object then pass dialect.delimiter to the sep
arg in pd.read_csv
您可以返回方言 object 然后将 dialect.delimiter 传递给
pd.read_csv
中的sep
arg
'text_a.csv'
cola|colb|col
A|B|C
E|F|G
A|B|C
E|F|G
'text_b.csv'
cola\tcolb\tcol
A\tB\tC
E\tF\tG
A\tB\tC
E\tF\tG
A\tB\tC
from csv import Sniffer
sniffer = Sniffer()
def detect_delim(file,num_rows,sniffer):
with open(file,'r') as f:
for row in range(num_rows):
line = next(f).strip()
delim = sniffer.sniff(line)
print(delim.delimiter) # ideally you should return the dialect object - just being lazy.
detect_delim(file='text_a.csv',num_rows=5,sniffer=sniffer)
'|'
detect_delim(file='text_b.csv',num_rows=5,sniffer=sniffer)
'\t'
I'd just read the first row and see which gives you more columns:我只是阅读了第一行,看看哪一行给了你更多的列:
import pandas as pd
tab = pd.read_csv(data_file, nrows=1, sep='\t').shape[1]
com = pd.read_csv(data_file, nrows=1, sep=',').shape[1]
if tab > com:
df = pd.read_csv(data_file, sep='\t')
else:
df = pd.read_csv(data_file, sep=',')
Is this useful ? 这有用吗? You can used python regex parser with
read_csv
and specify different delimiters.您可以将 python 正则表达式解析器与
read_csv
一起使用并指定不同的分隔符。
Ask the user to specify how the file is formatted if you don't expect to be able to determine from the file contents itself.如果您不希望能够从文件内容本身确定,请让用户指定文件的格式。
Eg a flag of some sort as --tab-delimited-file=true
and then you flip the separator based on their input.例如,某种标志为
--tab-delimited-file=true
,然后根据它们的输入翻转分隔符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.