简体   繁体   English

numpy 将数据转换为 numpy 数组

[英]numpy converting data into numpy array

I have an extract from the database, the data is delimiter by "|"我有一个从数据库中提取的数据,数据由“|”分隔I trying to load this to numpy array to perform some filtering.我试图将其加载到 numpy 数组以执行一些过滤。 For example save into file only lines which contains in 3th column LOGOUT.例如,仅将包含在第 3 列 LOGOUT 中的行保存到文件中。 I started from load example.txt file using:我从加载 example.txt 文件开始使用:

import numpy as np


data = np.genfromtxt('example.txt',
                 skip_header=1,
                 skip_footer=1,
                 names=True,
                 dtype=None,
                 delimiter='|',
                 encoding='utf-8',
                 filling_values=None)

but I get the error:但我得到了错误:

ValueError: Some errors were detected !
Line #3 (got 14 columns instead of 13)
Line #4 (got 14 columns instead of 13)
Line #5 (got 14 columns instead of 13)

data in txt file is: txt文件中的数据为:

|ID|TIMESTAMP|EVENT_DATE|GROUP|EVENT|CHANNEL|WERT|WERTY|WERTY|SESSION_ID|IP|WERT|DATA|
|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qwewqeq||weqeqewqewe
|5818222|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD||qweqe|weqeqewqewe
|5818222|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qweqe||weqeqewqewe
|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|||weqeqewqewe
|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|||weqeqewqewe
|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qweqwe|wqewqe|weqeqewqewe

each line contains no more than 13 elements.. what I'm i doing wrong?每行包含不超过 13 个元素.. 我做错了什么?

If your data is in example.txt , you can do:如果您的数据在example.txt中,您可以执行以下操作:

with open('example.txt') as fp:
    lines = fp.read().splitlines()
data = [x.split('|')[1:] for x in lines][1:]

where the indexing is used to discard the header and empty column.其中索引用于丢弃 header 和空列。 You will get a two-dimensional array containing the data in the file.您将获得一个包含文件中数据的二维数组。 If you need it as a Numpy array, do np.array(data) .如果您需要它作为 Numpy 数组,请执行np.array(data)

First, the reason why the problem is shown only on line 3,4,5 is because of the skip_header , skip_footer首先,问题仅在第 3,4,5 行显示的原因是由于skip_header , skip_footer

whithout the skip_footer :没有skip_footer

import numpy as np


data = np.genfromtxt('example.txt',
                 skip_header=1,
                 names=True,
                 dtype=None,
                 delimiter='|',
                 encoding='utf-8',
                 filling_values=None)

Error:错误:

    Line #3 (got 14 columns instead of 13)
    Line #4 (got 14 columns instead of 13)
    Line #5 (got 14 columns instead of 13)
    Line #6 (got 14 columns instead of 13)
    Line #7 (got 14 columns instead of 13)

So first, the skip_header value should be 0. Result:所以首先, skip_header的值应该是 0。结果:

data = np.genfromtxt('example.txt',
                 names=True,
                 dtype=None,
                 delimiter='|',
                 encoding='utf-8',
                 filling_values=None)

Result:结果:

array([(False, 5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qwewqeq', '', 'weqeqewqewe'),
       (False, 5818222, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', 'qweqe', 'weqeqewqewe'),
       (False, 5818222, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qweqe', '', 'weqeqewqewe'),
       (False, 5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', '', 'weqeqewqewe'),
       (False, 5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', '', 'weqeqewqewe'),
       (False, 5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qweqwe', 'wqewqe', 'weqeqewqewe')],
      dtype=[('ID', '?'), ('TIMESTAMP', '<i4'), ('EVENT_DATE', '<U25'), ('GROUP', '<U10'), ('EVENT', '<U6'), ('CHANNEL', '<U14'), ('WERT', '?'), ('WERTY', '<U15'), ('WERTY_1', '<U15'), ('SESSION_ID', '<U8'), ('IP', '<U22'), ('WERT_1', '<U7'), ('DATA', '<U6'), ('f0', '<U11')])

The reason why first column values are False and dtype are wrong is because the first line of the txt file contain more delimiter than other lines第一列值为Falsedtype错误的原因是因为 txt 文件的第一行包含的分隔符比其他行多

>>>line0= "|ID|TIMESTAMP|EVENT_DATE|GROUP|EVENT|CHANNEL|WERT|WERTY|WERTY|SESSION_ID|IP|WERT|DATA|"
>>>line1 = 
"|5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qwewqeq||weqeqewqewe"
>>>delimiter  = '|'
>>>line0.count(delimiter)
14
>>>line1.count(delimiter)
13

Solution: For 1 delimiter we have 2 informations, here we have 13 informations so we need only 12 delimiters, finally: txt file:解决方案:对于 1 个分隔符,我们有 2 个信息,这里我们有 13 个信息,所以我们只需要 12 个分隔符,最后:txt 文件:

ID|TIMESTAMP|EVENT_DATE|GROUP|EVENT|CHANNEL|WERT|WERTY|WERTY|SESSION_ID|IP|WERT|DATA
5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qwewqeq||weqeqewqewe
5818222|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD||qweqe|weqeqewqewe
5818222|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qweqe||weqeqewqewe
5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGOUT|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|||weqeqewqewe
5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|||weqeqewqewe
5818221|2021-03-15T18:18:20+01:00|2021-03-15|LOGIN|SESSION-EXPIRE||qweqwewqewqewqe|qweqewqewqwqeqw|STANDARD|lAkpligg11Ds9nJGFRPdeD|qweqwe|wqewqe|weqeqewqewe

code:代码:

data = np.genfromtxt('d2.txt',names=True,dtype=None,delimiter='|',encoding='utf-8',filling_values=None,skip_header=0)

Result:结果:

array([(5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qwewqeq', '', 'weqeqewqewe'),
       (5818222, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', 'qweqe', 'weqeqewqewe'),
       (5818222, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qweqe', '', 'weqeqewqewe'),
       (5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGOUT', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', '', 'weqeqewqewe'),
       (5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', '', '', 'weqeqewqewe'),
       (5818221, '2021-03-15T18:18:20+01:00', '2021-03-15', 'LOGIN', 'SESSION-EXPIRE', False, 'qweqwewqewqewqe', 'qweqewqewqwqeqw', 'STANDARD', 'lAkpligg11Ds9nJGFRPdeD', 'qweqwe', 'wqewqe', 'weqeqewqewe')],
      dtype=[('ID', '<i4'), ('TIMESTAMP', '<U25'), ('EVENT_DATE', '<U10'), ('GROUP', '<U6'), ('EVENT', '<U14'), ('CHANNEL', '?'), ('WERT', '<U15'), ('WERTY', '<U15'), ('WERTY_1', '<U8'), ('SESSION_ID', '<U22'), ('IP', '<U7'), ('WERT_1', '<U6'), ('DATA', '<U11')])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM