简体   繁体   中英

Read CSV file using Pandas: complex separator

I have a csv file which I want to read using python panda. The header and lines looks the following:

 A           ^B^C^D^E  ^F          ^G           ^H^I^J^K^L^M^N

Clearly it seen that, separator is ^, sometimes there are some odd spaces. How can I read this file perfectly?

I am using the following command to read the csv file:

df = pd.read_csv('input.csv', sep='^')

Use regex \\s*\\^ which means 0 or more whitespace and ^, you have to specify the python engine here to avoid a warning about regex support:

In [152]:

t="""A           ^B^C^D^E  ^F          ^G           ^H^I^J^K^L^M^N"""
df= pd.read_csv(io.StringIO(t), sep='\s*\^', engine='python')
df.columns
Out[152]:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N'], dtype='object')

你不能作为分离器提供正则表达式吗?

sep = re.compile(r'[\^\s]+')

Your separator can be a regular expression, so try something like this:

df = pd.read_csv('input.csv', sep="[ ^]+")

The regular expression should use any number of spaces or carets (^) in a row as a single separator.

Read the file as you have done and then strip extra whitespace for each column which is a string:

df = (pd.read_csv('input.csv', sep="^")
      .apply(lambda x: x.str.strip() if isinstance(x, str) else x))

If the only whitespace in your file is the extra whitespace between columns (ie no columns have raw text with spaces), an easy fix would be to simply remove all the spaces in the file. An example command to do that would be:

<input.csv tr -d '[[:blank:]]' > new_input.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM