简体   繁体   English

使用Pandas读取CSV文件:复杂分隔符

[英]Read CSV file using Pandas: complex separator

I have a csv file which I want to read using python panda. 我有一个csv文件,我想用python panda阅读。 The header and lines looks the following: 标题和行看起来如下:

 A           ^B^C^D^E  ^F          ^G           ^H^I^J^K^L^M^N

Clearly it seen that, separator is ^, sometimes there are some odd spaces. 显然它看到,分隔符是^,有时候有一些奇怪的空格。 How can I read this file perfectly? 我怎样才能完美地阅读这个文件?

I am using the following command to read the csv file: 我使用以下命令来读取csv文件:

df = pd.read_csv('input.csv', sep='^')

Use regex \\s*\\^ which means 0 or more whitespace and ^, you have to specify the python engine here to avoid a warning about regex support: 使用regex \\s*\\^表示0或更多空格和^,你必须在这里指定python引擎以避免有关正则表达式支持的警告:

In [152]:

t="""A           ^B^C^D^E  ^F          ^G           ^H^I^J^K^L^M^N"""
df= pd.read_csv(io.StringIO(t), sep='\s*\^', engine='python')
df.columns
Out[152]:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N'], dtype='object')

你不能作为分离器提供正则表达式吗?

sep = re.compile(r'[\^\s]+')

Your separator can be a regular expression, so try something like this: 您的分隔符可以是正则表达式,因此请尝试以下方法:

df = pd.read_csv('input.csv', sep="[ ^]+")

The regular expression should use any number of spaces or carets (^) in a row as a single separator. 正则表达式应使用行中的任意数量的空格或插入符号(^)作为单个分隔符。

Read the file as you have done and then strip extra whitespace for each column which is a string: 像你一样读取文件,然后为每个字符串删除额外的空格,这是一个字符串:

df = (pd.read_csv('input.csv', sep="^")
      .apply(lambda x: x.str.strip() if isinstance(x, str) else x))

If the only whitespace in your file is the extra whitespace between columns (ie no columns have raw text with spaces), an easy fix would be to simply remove all the spaces in the file. 如果文件中唯一的空格是列之间的额外空格(即没有列具有带空格的原始文本),则可以轻松修复文件中的所有空格。 An example command to do that would be: 这样做的示例命令是:

<input.csv tr -d '[[:blank:]]' > new_input.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM