简体   繁体   中英

Pandas combines 2 columns when importing from csv

I'm using Python 3.3.5 and pandas 0.16.2. When trying to read a file from csv, it combines 2 columns together when a null character (00) is at the end of the data in the fist column.

So the data is 4 columns like this:

"LANE_1<NUL>","17","21.8","68.3"

where < NUL> is a null character, or hex 00. It takes the first two comma delimited items and puts them into one resulting in

LANE_1',17' | 21.8 | 68.3

making 3 columns instead of the 4 it should be

LANE_1 | 17 | 21.8 | 68.3

It is like somehow pandas isn't recognizing the first comma. Is there any way to fix this without having to go and modify all of the .csv files to remove the null characters? Excel seems to open the file just fine separating the first 2 columns.

If the NUL is not an integral part of your data but an artifact/noise, I would prefer to clean it up. Otherwise you may have trouble later on when working with the data.

If you know that the null will only show up at the separator, you can just use a regex separator:

In [43]: s
Out[43]: 'a\x00,b,c\nd\x00,e,f'

In [44]: print s
a,b,c
d,e,f

In [45]: pd.read_csv(StringIO.StringIO(s))
Out[45]: 
   a,b  c
0  d,e  f

In [46]: pd.read_csv(StringIO.StringIO(s), sep="\x00?,", engine="python")
Out[46]: 
   a  b  c
0  d  e  f

EDIT:

As you pointed out, it gets a little weird with the quoting. The other answer suggesting data cleanup actually might be better, but you can sort of get around it with some less pretty tricks:

In [109]: s = '"a\x00","b","c"\n"d\x00","e","f"'

In [110]: pd.read_csv(StringIO.StringIO(s), sep='\x00?,', engine="python")
Out[110]: 
   "a"  "b"  "c"
0  "d"  "e"  "f"

In [111]: pd.read_csv(StringIO.StringIO(s), sep='\x00?,',
converters={c: lambda x: x.strip('\x00"') for c in xrange(3)}, engine="python")
Out[111]: 
  "a" "b" "c"
0    d   e   f

In [112]: df = pd.read_csv(StringIO.StringIO(s), sep='\x00?,',
converters={c: lambda x: x.strip('\x00"') for c in xrange(3)}, engine="python")

In [113]: df.columns = [c.strip('\x00"') for c in df.columns]

In [114]: df
Out[114]: 
   a  b  c
0  d  e  f

When you run the pandas.read_csv() method you could use the index_col=False argument to just get standard a standard integer column index. ie:

df = pandas.read_csv(pathname, index_col=False)

If the names of the columns are actually important you could just create the dataframe as you are now but then rename the columns with correct list of comma delimited labels. That command would be:

df.columns = list_of_column_labels

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM