简体   繁体   中英

separate datafiles negative sign and white-space delimiter

I am trying to separate identify both white space ' ' and '-' as column delimiterles. My files have the bug of not consistenly being separated by a space, example:

8.55500000  42.93079187 -99.98428964  -0.59917942  20.86164814   8.37369433   0.56431509
8.55600000  42.94500503-100.05470144  -0.55062999  20.86380446   8.38865674   0.56429834
8.55700000  42.99565203-100.11651750  -0.54444340  20.87003752   8.39975047   0.55109542
8.55800000  42.99873154-100.07383720  -0.54648262  20.85777962   8.41246904   0.55645774

This is a more complex use of sep so this is the explanation. You cannot keep the separator as part of the column for only some cases, so this time the code is actually keeping the separator as the column. This is defined as an optional - sign, followed consecutive numbers. This approach will solve the issue however it is going to create multiple nan columns (which are dropped). If the file is large in terms of columns and rows, this could lead to memory problems.

from io import StringIO
S = '''
8.500000  42.93079187 -99.98428964  -0.59917942  20.86164814   8.37369433   0.56431509
8.55600000  42.94500503-100.05470144  -0.55062999  20.86380446   8.38865674   0.56429834
8.55700000  42.99565203-100.11651750  -0.54444340  20.87003752   8.39975047   0.55109542
8.55800000  42.99873154-100.07383720  -0.54648262  20.85777962   8.41246904   0.55645774'''

df = pd.read_csv(StringIO(S),
                 sep='\s*(-?[0-9\.]+)',
                 engine='python', header=None).dropna(axis=1)

df.head()
#   1       3           5           7           9           11          13
# 0 8.500   42.930792   -99.984290  -0.599179   20.861648   8.373694    0.564315
# 1 8.556   42.945005   -100.054701 -0.550630   20.863804   8.388657    0.564298
# 2 8.557   42.995652   -100.116518 -0.544443   20.870038   8.399750    0.551095
# 3 8.558   42.998732   -100.073837 -0.546483   20.857780   8.412469    0.556458

If all your file data is in that simple format, then this approach can efficiently produce row data that pandas can use to build your dataframes.

>>> import re
>>> 
>>> float_expr = re.compile(r"-?\d*\.?\d+")
>>> 
>>> def gen_file_data(f):
...     for line in f:
...         line_data = float_expr.findall(line)
...         yield (float(v) for v in line_data)
...         
>>> df = pd.DataFrame.from_records(gen_file_data(open('filedata.txt', 'r')))
>>> 
>>> df
       0          1           2         3          4         5         6
0  8.555  42.930792  -99.984290 -0.599179  20.861648  8.373694  0.564315
1  8.556  42.945005 -100.054701 -0.550630  20.863804  8.388657  0.564298
2  8.557  42.995652 -100.116518 -0.544443  20.870038  8.399750  0.551095
3  8.558  42.998732 -100.073837 -0.546483  20.857780  8.412469  0.556458
>>> 

Header row?

>>> def gen_file_data(f):
...     yield next(f).split()  # Header row?
...     for line in f:
...         line_data = float_expr.findall(line)
...         yield (float(v) for v in line_data)
...         
>>> g = gen_file_data(open("filedata.txt", 'r'))
>>> 
>>> df = pd.DataFrame.from_records(g, columns=next(g))
>>> df
     foo        bar         baz       qux       quux      quuz     corge
0  8.555  42.930792  -99.984290 -0.599179  20.861648  8.373694  0.564315
1  8.556  42.945005 -100.054701 -0.550630  20.863804  8.388657  0.564298
2  8.557  42.995652 -100.116518 -0.544443  20.870038  8.399750  0.551095
3  8.558  42.998732 -100.073837 -0.546483  20.857780  8.412469  0.556458
>>>

The generator assumes that the header row is composed of headers that are continguous characters separated by whitespace. If it's some other pattern, the first line of the generator can be updated to handle it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM