如何使用NumPy在Python中读取二进制文件？

Question

I know how to read binary files in Python using NumPy's np.fromfile() function. 我知道如何使用NumPy的np.fromfile()函数在Python中读取二进制文件。 The issue I'm faced with is that when I do so, the array has exceedingly large numbers of the order of 10^100 or so, with random nan and inf values. 我面临的问题是，当我这样做时，数组的数量非常大，大约为10 ^ 100左右，具有随机的nan和inf值。

I need to apply machine learning algorithms to this dataset and I cannot work with this data. 我需要将机器学习算法应用于此数据集，我无法使用此数据。 I cannot normalise the dataset because of the nan values. 由于nan值，我无法规范化数据集。

I've tried np.nan_to_num() but that doesn't seem to work. 我试过np.nan_to_num()但这似乎不起作用。 After doing so, my min and max values range from 3e-38 and 3e+38 respectively, so I could not normalize it. 这样做之后，我的最小值和最大值分别为3e-38和3e + 38，所以我无法将其标准化。

Is there any way to scale this data down? 有没有办法缩小这些数据？ If not, how should I deal with this? 如果没有，我应该如何处理？

Thank you. 谢谢。

EDIT: 编辑：

Some context. 一些背景。 I'm working on a malware classification problem. 我正在研究恶意软件分类问题。 My dataset consists of live malware binaries. 我的数据集由实时恶意软件二进制文件组成。 They are files of the type .exe, .apk etc. My idea is store these binaries as a numpy array, convert to a grayscale image and then perform pattern analysis on it. 它们是.exe，.apk等类型的文件。我的想法是将这些二进制文件存储为numpy数组，转换为灰度图像，然后对其执行模式分析。

Answer 1

If you want to make an image out of a binary file, you need to read it in as integer, not float. 如果要从二进制文件中创建图像，则需要以整数形式读取，而不是浮点数。 Currently, the most common format for images is unsigned 8-bit integers. 目前，最常见的图像格式是无符号8位整数。

As an example, let's make an image out of the first 10,000 bytes of /bin/bash: 举个例子，让我们从/ bin / bash的前10,000个字节中创建一个图像：

>>> import numpy as np
>>> import cv2
>>> xbash = np.fromfile('/bin/bash', dtype='uint8')
>>> xbash.shape
(1086744,)
>>> cv2.imwrite('bash1.png', xbash[:10000].reshape(100,100))

In the above, we used the OpenCV library to write the integers to a PNG file. 在上面，我们使用OpenCV库将整数写入PNG文件。 Any of several other imaging libraries could have been used. 可以使用任何其他几个成像库。

This what the first 10,000 bytes of bash "looks" like: 这就是bash的前10,000个字节“看起来”的样子：

Answer 2

EDIT 2 编辑2

Refer this answer: https://stackoverflow.com/a/11548224/6633975 请参阅以下答案： https ： //stackoverflow.com/a/11548224/6633975
It states: NaN can't be stored in an integer array. 它指出： NaN不能存储在整数数组中。 This is a known limitation of pandas at the moment; 这是目前大熊猫的一个已知限制; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems: 我一直在等待在NumPy中使用NA值取得进展（类似于R中的NAs），但是在NumPy获得这些功能之前至少需要6个月到一年，似乎：
source: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na 来源： http ： //pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

Numpy integer nan Numpy整数南
Accepted answer states: NaN can't be stored in an integer array. 接受的答案状态： NaN不能存储在整数数组中。 A nan is a special value for float arrays only . 甲nan 仅用于浮法阵列的特殊值。 There are talks about introducing a special bit that would allow non-float arrays to store what in practice would correspond to a nan , but so far (2012/10), it's only talks. 有关于引入一个特殊位的讨论将允许非浮点数组存储实际上对应于nan ，但到目前为止（2012/10），它只是会谈。 In the meantime, you may want to consider the numpy.ma package: instead of picking an invalid integer like -99999, you could use the special numpy.ma.masked value to represent an invalid value. 在此期间，您可能需要考虑numpy.ma包：您可以使用特殊的numpy.ma.masked值来表示无效值，而不是选择无效的整数（如-99999）。

a = np.ma.array([1,2,3,4,5], dtype=int)
a[1] = np.ma.masked
masked_array(data = [1 -- 3 4 5],
             mask = [False  True False False False],
       fill_value = 999999)

EDIT 1 编辑1

To read binary file: 要读取二进制文件：

Read the binary file content like this: 像这样读取二进制文件内容：
```
 with open(fileName, mode='rb') as file: # b is important -> binary fileContent = file.read() 
```
After that you can "unpack" binary data using struct.unpack 之后，您可以使用struct.unpack “解包”二进制数据
If you are using np.fromfile() function: 如果您使用的是np.fromfile()函数：
numpy.fromfile , which can read data from both text and binary files. numpy.fromfile ，可以读取文本和二进制文件中的数据。 You would first construct a data type, which represents your file format, using numpy.dtype , and then read this type from file using numpy.fromfile . 您将首先使用numpy.dtype构造一个表示文件格式的数据类型，然后使用numpy.fromfile从文件中读取此类型。

如何使用NumPy在Python中读取二进制文件？

问题描述

2 个解决方案

解决方案1
12 已采纳 2016-09-29 06:15:20

解决方案2
0 2016-09-29 05:43:18

如何使用NumPy在Python中读取二进制文件？

问题描述

2 个解决方案

解决方案1 12 已采纳 2016-09-29 06:15:20

解决方案2 0 2016-09-29 05:43:18

解决方案1
12 已采纳 2016-09-29 06:15:20

解决方案2
0 2016-09-29 05:43:18