I am trying to separate identify both white space ' ' and '-' as column delimiterles. My files have the bug of not consistenly being separated by a space, example:
8.55500000 42.93079187 -99.98428964 -0.59917942 20.86164814 8.37369433 0.56431509
8.55600000 42.94500503-100.05470144 -0.55062999 20.86380446 8.38865674 0.56429834
8.55700000 42.99565203-100.11651750 -0.54444340 20.87003752 8.39975047 0.55109542
8.55800000 42.99873154-100.07383720 -0.54648262 20.85777962 8.41246904 0.55645774
This is a more complex use of sep
so this is the explanation. You cannot keep the separator as part of the column for only some cases, so this time the code is actually keeping the separator as the column. This is defined as an optional -
sign, followed consecutive numbers. This approach will solve the issue however it is going to create multiple nan
columns (which are dropped). If the file is large in terms of columns and rows, this could lead to memory problems.
from io import StringIO
S = '''
8.500000 42.93079187 -99.98428964 -0.59917942 20.86164814 8.37369433 0.56431509
8.55600000 42.94500503-100.05470144 -0.55062999 20.86380446 8.38865674 0.56429834
8.55700000 42.99565203-100.11651750 -0.54444340 20.87003752 8.39975047 0.55109542
8.55800000 42.99873154-100.07383720 -0.54648262 20.85777962 8.41246904 0.55645774'''
df = pd.read_csv(StringIO(S),
sep='\s*(-?[0-9\.]+)',
engine='python', header=None).dropna(axis=1)
df.head()
# 1 3 5 7 9 11 13
# 0 8.500 42.930792 -99.984290 -0.599179 20.861648 8.373694 0.564315
# 1 8.556 42.945005 -100.054701 -0.550630 20.863804 8.388657 0.564298
# 2 8.557 42.995652 -100.116518 -0.544443 20.870038 8.399750 0.551095
# 3 8.558 42.998732 -100.073837 -0.546483 20.857780 8.412469 0.556458
If all your file data is in that simple format, then this approach can efficiently produce row data that pandas can use to build your dataframes.
>>> import re
>>>
>>> float_expr = re.compile(r"-?\d*\.?\d+")
>>>
>>> def gen_file_data(f):
... for line in f:
... line_data = float_expr.findall(line)
... yield (float(v) for v in line_data)
...
>>> df = pd.DataFrame.from_records(gen_file_data(open('filedata.txt', 'r')))
>>>
>>> df
0 1 2 3 4 5 6
0 8.555 42.930792 -99.984290 -0.599179 20.861648 8.373694 0.564315
1 8.556 42.945005 -100.054701 -0.550630 20.863804 8.388657 0.564298
2 8.557 42.995652 -100.116518 -0.544443 20.870038 8.399750 0.551095
3 8.558 42.998732 -100.073837 -0.546483 20.857780 8.412469 0.556458
>>>
Header row?
>>> def gen_file_data(f):
... yield next(f).split() # Header row?
... for line in f:
... line_data = float_expr.findall(line)
... yield (float(v) for v in line_data)
...
>>> g = gen_file_data(open("filedata.txt", 'r'))
>>>
>>> df = pd.DataFrame.from_records(g, columns=next(g))
>>> df
foo bar baz qux quux quuz corge
0 8.555 42.930792 -99.984290 -0.599179 20.861648 8.373694 0.564315
1 8.556 42.945005 -100.054701 -0.550630 20.863804 8.388657 0.564298
2 8.557 42.995652 -100.116518 -0.544443 20.870038 8.399750 0.551095
3 8.558 42.998732 -100.073837 -0.546483 20.857780 8.412469 0.556458
>>>
The generator assumes that the header row is composed of headers that are continguous characters separated by whitespace. If it's some other pattern, the first line of the generator can be updated to handle it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.