简体   繁体   English

Python:如何快速创建一个 pandas 数据框,其中只有来自大型 excel 表的特定列?

[英]Python: How to quickly create a pandas data frame with only specific columns from a big excel sheet?

I have an excel file with only one sheet.我有一个只有一张纸的 excel 文件。 The size of the excel file is ~900 Mb and contains thousands of rows and hundreds of columns excel 文件的大小约为 900 Mb,包含数千行和数百列

I want to extract only a few columns (say Name , Numbers & Address ) from the excel sheet and do data manipulations.我只想从 excel 表中提取几列(比如NameNumbersAddress )并进行数据操作。

Since the excel file is huge, the traditional method of creating the data-frame using pandas and then extraction of columns takes a lot of time.由于 excel 文件很大,使用 pandas 创建数据框然后提取列的传统方法需要大量时间。

ExcelFile = pd.read_excel(fileAddress, sheet_name="Sheet1")

Is there a faster way to extract the columns from the excel file?有没有更快的方法从 excel 文件中提取列?

You may pass usecols to read_excel to import only specific columns from excel to df .您可以将usecols传递给read_excel以仅将特定列从 excel 导入到df If you use pandas 0.24+, read_excel is able to read directly on columns values, so just pass usecols with list of columns values如果您使用 pandas 0.24+, read_excel能够直接读取列值,因此只需传递带有列值列表的usecols

df = pd.read_excel(fileAddress, header=0, sheet_name='Sheet1', 
                                usecols=['Name', 'Numbers', 'Address'])

On pandas < 0.24, usecols doesn't understand excel cell values.在 pandas < 0.24 上, usecols无法理解 excel 单元格值。 You need to know Excel column letters corresponding to Name , Numbers , Address or their integer locations.您需要知道与NameNumbersAddress或其 integer 位置相对应的Excel column letters

For example: Name is at B ;例如: NameB处; Numbers at G ; G处的Numbers Address at AA AddressAA

df = pd.read_excel(fileAddress, header=0, sheet_name='Sheet1', usecols='B,G,AA')

If you know their integer locations, you may use them in place of 'B', 'G', 'AA' such as usecols=[1, 6, 26]如果您知道他们的 integer 位置,则可以使用它们代替“B”、“G”、“AA”,例如usecols=[1, 6, 26]

Hope this helps希望这可以帮助

There are a few ways you try and take the best approach that fits for you.您可以尝试几种方法并采取最适合您的方法。

1. Specify the required columns while loading the data. 1. 在加载数据时指定所需的列。 (just like Andy L. answer) (就像Andy L.回答)

df = pd.read_excel(fileAddress, header=0, sheet_name='Sheet1', 
                                usecols=['Name', 'Numbers', 'Address'])

2. Specify dtypes 2. 指定数据类型

Pandas, for every data read operation, does a heavy lifting job of identifying the data type by itself. Pandas,对于每一个数据读取操作,都会自己完成识别数据类型的繁重工作。 This consumes both memory and time.这会消耗 memory 和时间。 Also, this needs the whole data to be read at a time.此外,这需要一次读取整个数据。

To avoid it - Specify you column data types( dtype )为了避免它 - 指定你的列数据类型( dtype

Example:例子:

pd.read_csv('sample.csv', dtype={"user_id": int, "username": object})

Available data types in pandas pandas 中的可用数据类型

[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

(as you can see the list is too long, so if you specify the dtypes it would speed up your job) (如您所见,列表太长,因此如果您指定 dtypes 会加快您的工作速度)

3. You use a converter in case you need help in data conversions in your data. 3. 如果您需要数据转换方面的帮助,您可以使用转换器。

(Almost like 2, an alternative of 2). (几乎像 2,2 的替代品)。

In cases like null values or empty, you can easily deal here.在 null 值或为空的情况下,您可以在这里轻松处理。 (Disclaimer: I never tried this) (免责声明:我从未尝试过)

Example例子

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv('sample.csv', converters={'COL_A':conv,'COL_B':conv})

4. Reading the data in chunks always helps. 4. 分块读取数据总是有帮助的。

chunksize = 10 ** 6
for chunk in pd.read_csv('sample.csv', chunksize=chunksize):
    process(chunk)

One thing to note is to treat each chunk like a separate data frame.需要注意的一点是将每个chunk视为一个单独的数据框。 Helps read larger files like 4 GB or 6 GB also.也有助于读取更大的文件,如 4 GB 或 6 GB。

5. Use pandas low_memery option. 5. 使用 pandas low_memery 选项。

Use ( low_memory=False ) to explicitly tell pandas to load larger files into memory or in case you are getting any memory warning.使用 ( low_memory=False ) 明确告诉 pandas 将较大的文件加载到 memory 或如果您收到任何 memory 警告。

df = pd.read_csv('sample.csv', low_memory=False)

you can copy the columns of your interest from the file.xlsx to another.xlsx and then make the reading with pandas from another.xlsx您可以将您感兴趣的列从 file.xlsx 复制到 another.xlsx,然后使用 pandas 从 another.xlsx 进行读取

You can look up here , because pandas provide such specific methods.你可以看这里,因为pandas提供了这样的具体方法。

But more natively it will work like that:但更自然地它会像这样工作:

import csv
import toolz.curried as tc
import pandas as pd

def stream_csv(file_path):
    with open(file_path) as f:
        yield from csv.DictReader(f, delimiter='\t')  # you can use any delimiter

file_path = '../../data.csv'
relevant_data = map(tc.keyfilter(lambda column_name: column_name in ['a', 'b']),
                                stream_csv(file_path))

pd.DataFrame(relevant_data)

Note that everything but pandas ist a generator function and thus is memory efficient.请注意,除了 pandas 之外的所有内容都是生成器 function,因此 memory 是有效的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM