简体   繁体   中英

How to separate columns with special characters in Pandas, Python


My data file contain some characters that can not be defined from keybord to set as separator. Is there anyways to read this data in proper way.

My data looks different in.txt file but when I pasted here it looks like:

PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0
PW  100  2000  2000  C   0  0  0   0.00   0.00    0

But I have also attached original data here data .

To read data, I simply tried by this way:

import pandas as pd
pd.read_table('data.txt',sep = '\s+')

is there better way to do that?

You have to strip your file from invisible characters:

import pandas as pd
import io
import re

with open('pwd_data.txt') as fp:
    buffer = io.StringIO(re.sub('[\01-\03]', '', fp.read()))
    df = pd.read_table(buffer, header=None, sep='\s+')


>>> df
    0    1     2     3  4   5   6   7    8    9   10
0   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
1   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
2   PW  100  2000  2000  C   0   0   0  0.0  0.0   0
32  PW  100  2000  2000  C   0   0   0  0.0  0.0   0
33  PW  100  2000  2000  C   0   0   0  0.0  0.0   0
34  PW  100  2000  2000  C   0   0   0  0.0  0.0   0

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       35 non-null     object 
 1   1       35 non-null     int64  
 2   2       35 non-null     int64  
 3   3       35 non-null     int64  
 4   4       35 non-null     object 
 5   5       35 non-null     int64  
 6   6       35 non-null     int64  
 7   7       35 non-null     int64  
 8   8       35 non-null     float64
 9   9       35 non-null     float64
 10  10      35 non-null     int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 3.1+ KB

Try changing your sep to r'[\s+\x00-\x19]'

pd.read_table('data.txt',sep = r'[\s+\x00-\x19]')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM