简体   繁体   English

熊猫中的复杂定界符read_csv

[英]Complex delimite columns in Pandas read_csv

I'm trying to read some log files using Pandas, where the columns are delimited by whitespace, and some columns consist of single quoted strings with whitespace (eg 'string ' ). 我正在尝试使用Pandas读取一些日志文件,其中的列由空格分隔,并且某些列由带引号的带空格的字符串组成(例如'string ' )。 I am having a hard time reading these files with read_csv . 我很难用read_csv读取这些文件。 For example (using some dummy data): 例如(使用一些虚拟数据):

import pandas as pd
from io import StringIO

data = StringIO("""\
  1   2   'asdf    ' 3
  4   5   'asdfg   ' 4  
""")

columns = ['a','b','c','d']
df = pd.read_csv(data, delim_whitespace=True, names=columns)

For the first row, this results in columns 1 , 2 , 'asdf , ' , 3 , where I would prefer to have it as 1 , 2 , asdf , 3 . 对于第一行,这导致列12'asdf'3 ,在这里我更愿意把它当作12asdf3 The behavior makes total sense, but I can't find a way to make read_csv parse such files "correctly" (as I want it). 这种行为是完全合理的,但是我无法找到一种方法来使read_csv “正确”解析此类文件(如我所愿)。

Is this at all possible? 这是可能吗?

You have to use the quotechar argument while parsing from read_csv read_csv解析时,必须使用quotechar参数

df = pd.read_csv(filename, quotechar = "'", delim_whitespace=True, names=columns)

Although this will result in column c having extra whitespaces. 尽管这将导致列c具有额外的空格。 You can get rid of those using 您可以摆脱那些使用

df.c = df.c.str.strip()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM