简体   繁体   English

如何在 Pandas、Python 中分隔带有特殊字符的列

[英]How to separate columns with special characters in Pandas, Python


My data file contain some characters that can not be defined from keybord to set as separator.我的数据文件包含一些无法从键盘定义为分隔符的字符。 Is there anyways to read this data in proper way.反正有没有以正确的方式读取这些数据。

My data looks different in.txt file but when I pasted here it looks like:我的数据在.txt 文件中看起来不同,但是当我粘贴到这里时,它看起来像:

PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0

But I have also attached original data here data .但我也在这里附上了原始数据data

To read data, I simply tried by this way:要读取数据,我只是尝试了这种方式:

import pandas as pd
pd.read_table('data.txt',sep = '\s+')

is there better way to do that?有更好的方法吗?

You have to strip your file from invisible characters:您必须从不可见的字符中删除文件:

import pandas as pd
import io
import re

with open('pwd_data.txt') as fp:
    buffer = io.StringIO(re.sub('[\01-\03]', '', fp.read()))
    df = pd.read_table(buffer, header=None, sep='\s+')

Output: Output:

>>> df
    0    1     2     3  4   5   6   7    8    9   10
0   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
1   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
2   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
32  PW  100  2000  2000  C   0   0   0  0.0  0.0   0
33  PW  100  2000  2000  C   0   0   0  0.0  0.0   0
34  PW  100  2000  2000  C   0   0   0  0.0  0.0   0

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       35 non-null     object 
 1   1       35 non-null     int64  
 2   2       35 non-null     int64  
 3   3       35 non-null     int64  
 4   4       35 non-null     object 
 5   5       35 non-null     int64  
 6   6       35 non-null     int64  
 7   7       35 non-null     int64  
 8   8       35 non-null     float64
 9   9       35 non-null     float64
 10  10      35 non-null     int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 3.1+ KB

Try changing your sep to r'[\s+\x00-\x19]'尝试将您的 sep 更改为r'[\s+\x00-\x19]'

pd.read_table('data.txt',sep = r'[\s+\x00-\x19]')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM