[英]numpy converting data into numpy array
我有一个从数据库中提取的数据,数据由“|”分隔我试图将其加载到 numpy 数组以执行一些过滤。 例如,仅将包含在第 3 列 LOGOUT 中的行保存到文件中。 我从加载 example.txt 文件开始使用:
import numpy as np
data = np.genfromtxt('example.txt',
skip_header=1,
skip_footer=1,
names=True,
dtype=None,
delimiter='|',
encoding='utf-8',
filling_values=None)
但我得到了错误:
ValueError: Some errors were detected !
Line #3 (got 14 columns instead of 13)
Line #4 (got 14 columns instead of 13)
Line #5 (got 14 columns instead of 13)
txt文件中的数据为:
|ID|TIMESTAMP|EVENT_DATE|GROUP|EVENT|CHANNEL|WERT|WERTY|WERTY|SESSION_ID|IP|WERT|DATA|
|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qwewqeq||weqeqewqewe
|5818222|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD||qweqe|weqeqewqewe
|5818222|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qweqe||weqeqewqewe
|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|||weqeqewqewe
|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|||weqeqewqewe
|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qweqwe|wqewqe|weqeqewqewe
每行包含不超过 13 个元素.. 我做错了什么?
如果您的数据在example.txt
中,您可以执行以下操作:
with open('example.txt') as fp:
lines = fp.read().splitlines()
data = [x.split('|')[1:] for x in lines][1:]
其中索引用于丢弃 header 和空列。 您将获得一个包含文件中数据的二维数组。 如果您需要它作为 Numpy 数组,请执行np.array(data)
。
首先,问题仅在第 3,4,5 行显示的原因是由于skip_header
, skip_footer
没有skip_footer
:
import numpy as np
data = np.genfromtxt('example.txt',
skip_header=1,
names=True,
dtype=None,
delimiter='|',
encoding='utf-8',
filling_values=None)
错误:
Line #3 (got 14 columns instead of 13)
Line #4 (got 14 columns instead of 13)
Line #5 (got 14 columns instead of 13)
Line #6 (got 14 columns instead of 13)
Line #7 (got 14 columns instead of 13)
所以首先, skip_header
的值应该是 0。结果:
data = np.genfromtxt('example.txt',
names=True,
dtype=None,
delimiter='|',
encoding='utf-8',
filling_values=None)
结果:
array([(False, 5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qwewqeq', '', 'weqeqewqewe'),
(False, 5818222, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', 'qweqe', 'weqeqewqewe'),
(False, 5818222, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qweqe', '', 'weqeqewqewe'),
(False, 5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', '', 'weqeqewqewe'),
(False, 5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', '', 'weqeqewqewe'),
(False, 5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qweqwe', 'wqewqe', 'weqeqewqewe')],
dtype=[('ID', '?'), ('TIMESTAMP', '<i4'), ('EVENT_DATE', '<U25'), ('GROUP', '<U10'), ('EVENT', '<U6'), ('CHANNEL', '<U14'), ('WERT', '?'), ('WERTY', '<U15'), ('WERTY_1', '<U15'), ('SESSION_ID', '<U8'), ('IP', '<U22'), ('WERT_1', '<U7'), ('DATA', '<U6'), ('f0', '<U11')])
第一列值为False
和dtype
错误的原因是因为 txt 文件的第一行包含的分隔符比其他行多
>>>line0= "|ID|TIMESTAMP|EVENT_DATE|GROUP|EVENT|CHANNEL|WERT|WERTY|WERTY|SESSION_ID|IP|WERT|DATA|"
>>>line1 =
"|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qwewqeq||weqeqewqewe"
>>>delimiter = '|'
>>>line0.count(delimiter)
14
>>>line1.count(delimiter)
13
解决方案:对于 1 个分隔符,我们有 2 个信息,这里我们有 13 个信息,所以我们只需要 12 个分隔符,最后:txt 文件:
ID|TIMESTAMP|EVENT_DATE|GROUP|EVENT|CHANNEL|WERT|WERTY|WERTY|SESSION_ID|IP|WERT|DATA
5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qwewqeq||weqeqewqewe
5818222|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD||qweqe|weqeqewqewe
5818222|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qweqe||weqeqewqewe
5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|||weqeqewqewe
5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|||weqeqewqewe
5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qweqwe|wqewqe|weqeqewqewe
代码:
data = np.genfromtxt('d2.txt',names=True,dtype=None,delimiter='|',encoding='utf-8',filling_values=None,skip_header=0)
结果:
array([(5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qwewqeq', '', 'weqeqewqewe'),
(5818222, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', 'qweqe', 'weqeqewqewe'),
(5818222, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qweqe', '', 'weqeqewqewe'),
(5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', '', 'weqeqewqewe'),
(5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', '', 'weqeqewqewe'),
(5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qweqwe', 'wqewqe', 'weqeqewqewe')],
dtype=[('ID', '<i4'), ('TIMESTAMP', '<U25'), ('EVENT_DATE', '<U10'), ('GROUP', '<U6'), ('EVENT', '<U14'), ('CHANNEL', '?'), ('WERT', '<U15'), ('WERTY', '<U15'), ('WERTY_1', '<U8'), ('SESSION_ID', '<U22'), ('IP', '<U7'), ('WERT_1', '<U6'), ('DATA', '<U11')])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.