[英]How to separate columns with special characters in Pandas, Python
My data file contain some characters that can not be defined from keybord to set as separator.我的数据文件包含一些无法从键盘定义为分隔符的字符。 Is there anyways to read this data in proper way.反正有没有以正确的方式读取这些数据。
My data looks different in.txt file but when I pasted here it looks like:我的数据在.txt 文件中看起来不同,但是当我粘贴到这里时,它看起来像:
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
PW 100 2000 2000 C 0 0 0 0.00 0.00 0
But I have also attached original data here data .但我也在这里附上了原始数据data 。
To read data, I simply tried by this way:要读取数据,我只是尝试了这种方式:
import pandas as pd
pd.read_table('data.txt',sep = '\s+')
is there better way to do that?有更好的方法吗?
You have to strip your file from invisible characters:您必须从不可见的字符中删除文件:
import pandas as pd
import io
import re
with open('pwd_data.txt') as fp:
buffer = io.StringIO(re.sub('[\01-\03]', '', fp.read()))
df = pd.read_table(buffer, header=None, sep='\s+')
Output: Output:
>>> df
0 1 2 3 4 5 6 7 8 9 10
0 PW 100 2000 2000 C 0 0 0 0.0 0.0 0
1 PW 100 2000 2000 C 0 0 0 0.0 0.0 0
2 PW 100 2000 2000 C 0 0 0 0.0 0.0 0
...
32 PW 100 2000 2000 C 0 0 0 0.0 0.0 0
33 PW 100 2000 2000 C 0 0 0 0.0 0.0 0
34 PW 100 2000 2000 C 0 0 0 0.0 0.0 0
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 35 non-null object
1 1 35 non-null int64
2 2 35 non-null int64
3 3 35 non-null int64
4 4 35 non-null object
5 5 35 non-null int64
6 6 35 non-null int64
7 7 35 non-null int64
8 8 35 non-null float64
9 9 35 non-null float64
10 10 35 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 3.1+ KB
Try changing your sep to r'[\s+\x00-\x19]'
尝试将您的 sep 更改为r'[\s+\x00-\x19]'
pd.read_table('data.txt',sep = r'[\s+\x00-\x19]')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.