如何僅從python中的大文件中讀取某些行？

Question

我有一個具有7000行的大數據文件（雖然不是很大！），看起來像這樣：

    # data can be obtained from pastebin
    # filename = input.csv
    # lots of comments
    #           wave           flux            err
            0.807172    7.61973e-11    1.18177e-13
            0.807375    7.58666e-11    1.18288e-13
            0.807577    7.62136e-11    1.18504e-13
             0.80778    7.64491e-11    1.19389e-13
            0.807982    7.62858e-11    1.18685e-13
            0.808185    7.63852e-11    1.19324e-13
            0.808387    7.60547e-11    1.18952e-13
             0.80859    7.52287e-11    1.18016e-13
            0.808792    7.53114e-11    1.18979e-13
            0.808995    7.58247e-11    1.20198e-13
    # lots of other lines

鏈接到輸入數據 ： http : //pastebin.com/KCW9phzX

我想提取波長介於0.807375和0.807982之間的數據。
這樣輸出看起來像這樣：

#filename = output.csv
0.807375    7.58666e-11    1.18288e-13
0.807577    7.62136e-11    1.18504e-13
0.80778    7.64491e-11    1.19389e-13
0.807982    7.62858e-11    1.18685e-13

類似的鏈接如下：

https://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779# =
從python中的大型csv數據文件中提取幾行數據的有效方法
在Python中將列表項與大文件中的行匹配的最有效方法是什么？
從文件中提取特定行並在python中創建數據部分
如何從python中的列表中提取元素？
當第一列是字符串而其余列是數字時，如何使用numpy.genfromtxt？
genfromtxt和numpy

Answer 1

您可以np.genfromtxt(f, max_rows=chunksize)調用np.genfromtxt(f, max_rows=chunksize)來讀取文件。 這樣，您可以保留NumPy數組的便利性和速度，同時可以通過調整chunksize控制所需的內存量。

import numpy as np
import warnings
# genfromtxt warns if it encounters an empty file. Let's silence this warnings since 
# the code below handles it.
warnings.filterwarnings("ignore", message='genfromtxt', category=UserWarning)

# This reads 2 lines at a time
chunksize = 2
with open('data', 'rb') as fin, open('out.csv', 'w+b') as fout:
    while True:
        arr = np.genfromtxt(fin, max_rows=chunksize, usecols=(0,1,2), 
                            delimiter='', dtype=float)
        if not arr.any(): break
        arr = np.atleast_2d(arr)
        mask = (arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)
        arr = arr[mask]

        # uncomment this print statement to confirm the file is being read in chunks
        # print('{}\n{}'.format(arr, '-'*80))
        np.savetxt(fout, arr, fmt='%g')

寫入out.csv ：

0.807375 7.58666e-11 1.18288e-13
0.807577 7.62136e-11 1.18504e-13
0.80778 7.64491e-11 1.19389e-13
0.807982 7.62858e-11 1.18685e-13

對於一個大的數據文件，你會當然希望增加chunksize一些整數比2大得多一般來說，你會得到通過選擇最佳的性能chunksize要盡可能大，同時仍對適合的RAM陣列操作。

上面的代碼用於大型文件。 對於只有7000行的文件，

import numpy as np
with open('data', 'rb') as fin, open('out.csv', 'w+b') as fout:
    arr = np.genfromtxt(fin, usecols=(0,1,2), delimiter='', dtype=float)
    mask = (arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)
    arr = arr[mask]
    np.savetxt(fout, arr, fmt='%g')

足夠了。

Answer 2

嘗試這個：

import pandas as pd 

df         = pd.read_csv('large_data.csv', usecols=(0,1,2), skiprows=57)
df.columns = [ 'wave', 'flux' , 'err']
df         = df[(df['wave'] >=  0.807375) & (df['wave'] <=  0.807982) ]
print df 

     wave           flux              err
1   0.807375    7.586660e-11    1.182880e-13
2   0.807577    7.621360e-11    1.185040e-13
3   0.807780    7.644910e-11    1.193890e-13
4   0.807982    7.628580e-11    1.186850e-13

由於您的行中包含不需要的文本，因此可以在導入時使用“ skiprows”標志。 另外，pandas建立在numpy之上，因此有chunksize標志

Answer 3

閱讀@ubuntu和@Merlin的答案，以下可能也是一個好的解決方案。

注意： @ubuntu給出的答案絕對可以。

@Merlin給出的答案不起作用，雖然不完整，但是可以作為一個很好的模板。

注意：輸入文件input.csv可以從pastebin獲取：
http://pastebin.com/KCW9phzX

使用numpy：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author    : Bhishan Poudel
# Date      : May 23, 2016


# Imports
import pandas as pd
import numpy as np


# using numpy
infile = 'input.csv'
outfile = 'output.csv'
lower_value = 0.807375
upper_value = 0.807982

print('{} {} {}'.format('Reading file    :', infile, ''))
print('{} {} {}'.format('Writing to file :', outfile, ''))

with open(infile, 'rb') as fin, open(outfile, 'w+b') as fout:
    arr = np.genfromtxt(fin, usecols=(0,1,2), delimiter='', dtype=float)
    mask = (arr[:, 0] >= lower_value) & (arr[:, 0] <= upper_value )
    arr = arr[mask]
    np.savetxt(fout, arr, fmt='%g')

使用熊貓：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author    : Bhishan Poudel
# Date      : May 23, 2016


# Imports
import pandas as pd
import numpy as np


# extract range
infile = 'input.csv'
outfile = 'output.csv'
lower_value = 0.807375
upper_value = 0.807982


print('{} {} {}'.format('Reading file      :', infile, ''))
print('{} {} {}'.format('Writing to a file : ', outfile, ''))
df         = pd.read_csv(infile, usecols=(0,1,2), skiprows=57,sep='\s+')
df.columns = [ 'col0', 'col1' , 'col2']
df         = df[(df['col0'] >=  lower_value) & (df['col0'] <=  upper_value) ]
df.to_csv(outfile, header=None, index=None, mode='w', sep=' ')

如何僅從python中的大文件中讀取某些行？

問題描述

3 個解決方案

解決方案1
4 已采納 2016-06-04 18:36:24

解決方案2
1 2016-06-04 18:38:34

解決方案3
0 2016-06-09 21:02:03

如何僅從python中的大文件中讀取某些行？

問題描述

3 個解決方案

解決方案1 4 已采納 2016-06-04 18:36:24

解決方案2 1 2016-06-04 18:38:34

解決方案3 0 2016-06-09 21:02:03

解決方案1
4 已采納 2016-06-04 18:36:24

解決方案2
1 2016-06-04 18:38:34

解決方案3
0 2016-06-09 21:02:03