简体   繁体   English

单独的数据文件负号和空白定界符

[英]separate datafiles negative sign and white-space delimiter

I am trying to separate identify both white space ' ' and '-' as column delimiterles.我试图将空格“”和“-”分别标识为列定界符。 My files have the bug of not consistenly being separated by a space, example:我的文件有不一致被空格分隔的错误,例如:

8.55500000  42.93079187 -99.98428964  -0.59917942  20.86164814   8.37369433   0.56431509
8.55600000  42.94500503-100.05470144  -0.55062999  20.86380446   8.38865674   0.56429834
8.55700000  42.99565203-100.11651750  -0.54444340  20.87003752   8.39975047   0.55109542
8.55800000  42.99873154-100.07383720  -0.54648262  20.85777962   8.41246904   0.55645774

This is a more complex use of sep so this is the explanation.这是sep的更复杂的使用,所以这就是解释。 You cannot keep the separator as part of the column for only some cases, so this time the code is actually keeping the separator as the column.您不能仅在某些情况下将分隔符保留为列的一部分,因此这次代码实际上将分隔符保留为列。 This is defined as an optional - sign, followed consecutive numbers.这被定义为一个可选的-符号,后跟连续的数字。 This approach will solve the issue however it is going to create multiple nan columns (which are dropped).这种方法将解决该问题,但是它将创建多个nan列(已删除)。 If the file is large in terms of columns and rows, this could lead to memory problems.如果文件的列和行很大,这可能会导致 memory 问题。

from io import StringIO
S = '''
8.500000  42.93079187 -99.98428964  -0.59917942  20.86164814   8.37369433   0.56431509
8.55600000  42.94500503-100.05470144  -0.55062999  20.86380446   8.38865674   0.56429834
8.55700000  42.99565203-100.11651750  -0.54444340  20.87003752   8.39975047   0.55109542
8.55800000  42.99873154-100.07383720  -0.54648262  20.85777962   8.41246904   0.55645774'''

df = pd.read_csv(StringIO(S),
                 sep='\s*(-?[0-9\.]+)',
                 engine='python', header=None).dropna(axis=1)

df.head()
#   1       3           5           7           9           11          13
# 0 8.500   42.930792   -99.984290  -0.599179   20.861648   8.373694    0.564315
# 1 8.556   42.945005   -100.054701 -0.550630   20.863804   8.388657    0.564298
# 2 8.557   42.995652   -100.116518 -0.544443   20.870038   8.399750    0.551095
# 3 8.558   42.998732   -100.073837 -0.546483   20.857780   8.412469    0.556458

If all your file data is in that simple format, then this approach can efficiently produce row data that pandas can use to build your dataframes.如果您所有的文件数据都是这种简单格式,那么这种方法可以有效地生成行数据,pandas 可以使用这些数据来构建您的数据框。

>>> import re
>>> 
>>> float_expr = re.compile(r"-?\d*\.?\d+")
>>> 
>>> def gen_file_data(f):
...     for line in f:
...         line_data = float_expr.findall(line)
...         yield (float(v) for v in line_data)
...         
>>> df = pd.DataFrame.from_records(gen_file_data(open('filedata.txt', 'r')))
>>> 
>>> df
       0          1           2         3          4         5         6
0  8.555  42.930792  -99.984290 -0.599179  20.861648  8.373694  0.564315
1  8.556  42.945005 -100.054701 -0.550630  20.863804  8.388657  0.564298
2  8.557  42.995652 -100.116518 -0.544443  20.870038  8.399750  0.551095
3  8.558  42.998732 -100.073837 -0.546483  20.857780  8.412469  0.556458
>>> 

Header row? Header 行?

>>> def gen_file_data(f):
...     yield next(f).split()  # Header row?
...     for line in f:
...         line_data = float_expr.findall(line)
...         yield (float(v) for v in line_data)
...         
>>> g = gen_file_data(open("filedata.txt", 'r'))
>>> 
>>> df = pd.DataFrame.from_records(g, columns=next(g))
>>> df
     foo        bar         baz       qux       quux      quuz     corge
0  8.555  42.930792  -99.984290 -0.599179  20.861648  8.373694  0.564315
1  8.556  42.945005 -100.054701 -0.550630  20.863804  8.388657  0.564298
2  8.557  42.995652 -100.116518 -0.544443  20.870038  8.399750  0.551095
3  8.558  42.998732 -100.073837 -0.546483  20.857780  8.412469  0.556458
>>>

The generator assumes that the header row is composed of headers that are continguous characters separated by whitespace.生成器假定 header 行由标题组成,标题是由空格分隔的连续字符。 If it's some other pattern, the first line of the generator can be updated to handle it.如果是其他模式,可以更新生成器的第一行来处理它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM