[英]separate datafiles negative sign and white-space delimiter
I am trying to separate identify both white space ' ' and '-' as column delimiterles.我试图将空格“”和“-”分别标识为列定界符。 My files have the bug of not consistenly being separated by a space, example:
我的文件有不一致被空格分隔的错误,例如:
8.55500000 42.93079187 -99.98428964 -0.59917942 20.86164814 8.37369433 0.56431509
8.55600000 42.94500503-100.05470144 -0.55062999 20.86380446 8.38865674 0.56429834
8.55700000 42.99565203-100.11651750 -0.54444340 20.87003752 8.39975047 0.55109542
8.55800000 42.99873154-100.07383720 -0.54648262 20.85777962 8.41246904 0.55645774
This is a more complex use of sep
so this is the explanation.这是
sep
的更复杂的使用,所以这就是解释。 You cannot keep the separator as part of the column for only some cases, so this time the code is actually keeping the separator as the column.您不能仅在某些情况下将分隔符保留为列的一部分,因此这次代码实际上将分隔符保留为列。 This is defined as an optional
-
sign, followed consecutive numbers.这被定义为一个可选的
-
符号,后跟连续的数字。 This approach will solve the issue however it is going to create multiple nan
columns (which are dropped).这种方法将解决该问题,但是它将创建多个
nan
列(已删除)。 If the file is large in terms of columns and rows, this could lead to memory problems.如果文件的列和行很大,这可能会导致 memory 问题。
from io import StringIO
S = '''
8.500000 42.93079187 -99.98428964 -0.59917942 20.86164814 8.37369433 0.56431509
8.55600000 42.94500503-100.05470144 -0.55062999 20.86380446 8.38865674 0.56429834
8.55700000 42.99565203-100.11651750 -0.54444340 20.87003752 8.39975047 0.55109542
8.55800000 42.99873154-100.07383720 -0.54648262 20.85777962 8.41246904 0.55645774'''
df = pd.read_csv(StringIO(S),
sep='\s*(-?[0-9\.]+)',
engine='python', header=None).dropna(axis=1)
df.head()
# 1 3 5 7 9 11 13
# 0 8.500 42.930792 -99.984290 -0.599179 20.861648 8.373694 0.564315
# 1 8.556 42.945005 -100.054701 -0.550630 20.863804 8.388657 0.564298
# 2 8.557 42.995652 -100.116518 -0.544443 20.870038 8.399750 0.551095
# 3 8.558 42.998732 -100.073837 -0.546483 20.857780 8.412469 0.556458
If all your file data is in that simple format, then this approach can efficiently produce row data that pandas can use to build your dataframes.如果您所有的文件数据都是这种简单格式,那么这种方法可以有效地生成行数据,pandas 可以使用这些数据来构建您的数据框。
>>> import re
>>>
>>> float_expr = re.compile(r"-?\d*\.?\d+")
>>>
>>> def gen_file_data(f):
... for line in f:
... line_data = float_expr.findall(line)
... yield (float(v) for v in line_data)
...
>>> df = pd.DataFrame.from_records(gen_file_data(open('filedata.txt', 'r')))
>>>
>>> df
0 1 2 3 4 5 6
0 8.555 42.930792 -99.984290 -0.599179 20.861648 8.373694 0.564315
1 8.556 42.945005 -100.054701 -0.550630 20.863804 8.388657 0.564298
2 8.557 42.995652 -100.116518 -0.544443 20.870038 8.399750 0.551095
3 8.558 42.998732 -100.073837 -0.546483 20.857780 8.412469 0.556458
>>>
Header row? Header 行?
>>> def gen_file_data(f):
... yield next(f).split() # Header row?
... for line in f:
... line_data = float_expr.findall(line)
... yield (float(v) for v in line_data)
...
>>> g = gen_file_data(open("filedata.txt", 'r'))
>>>
>>> df = pd.DataFrame.from_records(g, columns=next(g))
>>> df
foo bar baz qux quux quuz corge
0 8.555 42.930792 -99.984290 -0.599179 20.861648 8.373694 0.564315
1 8.556 42.945005 -100.054701 -0.550630 20.863804 8.388657 0.564298
2 8.557 42.995652 -100.116518 -0.544443 20.870038 8.399750 0.551095
3 8.558 42.998732 -100.073837 -0.546483 20.857780 8.412469 0.556458
>>>
The generator assumes that the header row is composed of headers that are continguous characters separated by whitespace.生成器假定 header 行由标题组成,标题是由空格分隔的连续字符。 If it's some other pattern, the first line of the generator can be updated to handle it.
如果是其他模式,可以更新生成器的第一行来处理它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.