[英]How to read specific rows/columns of a .CSV file and storing them as a numpy matrix?
I have a .CSV
file with contents like this: 我有一个.CSV
文件,其内容如下:
DATE OPEN HIGH LOW CLOSE PRICE YCLOSE VOL TICKS
13950309 1000000 1000000 1000000 1000000 1000000 1000000 2100000 74
13950326 1050000 1050010 1050000 1050001 1050000 1000000 1648 5
13950329 1030200 1060000 1030200 1044474 1042265 1050001 28469 108
13950330 1040001 1049999 1040001 1042303 1045001 1044474 6518 10
13950331 1049800 1050000 1048600 1048787 1050000 1042303 277 11
13950401 1059973 1059974 1052000 1053807 1055000 1048787 916 17
13950402 1050000 1054498 1043009 1048173 1043009 1053807 2098 29
13950405 1045678 1049989 1040002 1049961 1049979 1048173 28098 14
That for example don't need the DATE
column, or the first row(That contains strings). 例如,不需要DATE
列或第一行(包含字符串)。 So I like to read from row 2 up to row 25, and column 2 up to end column, then storing the data as a numpy
matrix. 所以我喜欢从第2行读取到第25行,从第2列读取到最后一列,然后将数据存储为numpy
矩阵。 How can I do this? 我怎样才能做到这一点?
EDIT: I tried this code as suggested in one of the answers: 编辑:我按照答案之一的建议尝试了此代码:
import pandas as pd
import numpy as np
data = pd.read_csv("C:/Users/m/Desktop/python/IRB3MAIZ9936-a.csv", sep="\s")
del data['DATE']
np.array(data.values)
But I got this result: 但是我得到了这个结果:
C:\Users\m\Desktop\python\read_csv.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
data = pd.read_csv("C:/Users/m/Desktop/python/IRB3MAIZ9936-a.csv", sep="\s")
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'DATE'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\m\Desktop\python\read_csv.py", line 6, in <module>
del data['DATE']
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2743, in __delitem__
self._data.delete(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py", line 4174, in delete
indexer = self.items.get_loc(item)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'DATE'
[Finished in 1.7s with exit code 1]
[shell_cmd: python -u "C:\Users\m\Desktop\python\read_csv.py"]
[dir: C:\Users\m\Desktop\python]
[path: C:\ProgramData\Anaconda3;C:\ProgramData\Anaconda3\Library\mingw-w64\bin;C:\ProgramData\Anaconda3\Library\usr\bin;C:\ProgramData\Anaconda3\Library\bin;C:\ProgramData\Anaconda3\Scripts;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\mingw64\bin;D:\cmake-3.11.3-win64-x64\cmake-3.11.3-win64-x64\bin;C:\opencv\build\install\x64\mingw\bin;C:\Program Files\nodejs\;C:\Program Files\MATLAB\R2018b\runtime\win64;C:\Program Files\MATLAB\R2018b\bin;C:\Program Files\Git\cmd;C:\Program Files\Microsoft SQL Server\130\Tools\Binn\;C:\Program Files\dotnet\;C:\Users\m\AppData\Local\Microsoft\WindowsApps;C:\Users\m\AppData\Roaming\npm;C:\Users\m\AppData\Local\Programs\Microsoft VS Code\bin]
This should give you an Idea about your problem solving. 这应该给您关于解决问题的想法。
import pandas as pd
import numpy as np
data = pd.read_csv("/Users/DHarun/Desktop/STD_MASTER/F_Bildverarbeitung/aim2/iaai/stack/xyz.csv", sep="\s")
del data['DATE']
np.array(data.values)
Output: 输出:
array([[1000000, 1000000, 1000000, 1000000, 1000000, 1000000, 2100000,
74],
[1050000, 1050010, 1050000, 1050001, 1050000, 1000000, 1648,
5],
[1030200, 1060000, 1030200, 1044474, 1042265, 1050001, 28469,
108],
[1040001, 1049999, 1040001, 1042303, 1045001, 1044474, 6518,
10],
[1049800, 1050000, 1048600, 1048787, 1050000, 1042303, 277,
11],
[1059973, 1059974, 1052000, 1053807, 1055000, 1048787, 916,
17],
[1050000, 1054498, 1043009, 1048173, 1043009, 1053807, 2098,
29],
[1045678, 1049989, 1040002, 1049961, 1049979, 1048173, 28098,
14],
[1050001, 1053000, 1046700, 1049473, 1046700, 1049961, 5498,
33]])
Just use the csv module to process the file, skipping first line and first column. 只需使用csv模块来处理文件,跳过第一行和第一列即可。 Code can be as simple as: 代码可以很简单:
with open('file.csv') as fd:
next(fd) # skip initial line
rd = csv.reader(fd, delimiter = ' ', skipinitialspace = True)
arr = np.array([[int(i) for i in row[1:]] for row in rd]) # skip initial column
print(repr(arr))
gives as expected: 给出预期:
array([[1000000, 1000000, 1000000, 1000000, 1000000, 1000000, 2100000,
74],
[1050000, 1050010, 1050000, 1050001, 1050000, 1000000, 1648,
5],
[1030200, 1060000, 1030200, 1044474, 1042265, 1050001, 28469,
108],
[1040001, 1049999, 1040001, 1042303, 1045001, 1044474, 6518,
10],
[1049800, 1050000, 1048600, 1048787, 1050000, 1042303, 277,
11],
[1059973, 1059974, 1052000, 1053807, 1055000, 1048787, 916,
17],
[1050000, 1054498, 1043009, 1048173, 1043009, 1053807, 2098,
29],
[1045678, 1049989, 1040002, 1049961, 1049979, 1048173, 28098,
14]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.