简体   繁体   English

使用read_csv将财务数据导入Python Pandas

[英]Importing financial data into Python Pandas using read_csv

I have a .csv with the following structure: 我有一个.csv具有以下结构:

date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count
1/1/2008,1/31/2008,Name_1,1001,"123,456","$7,890,123.45","67,890"
2/1/2008,2/29/2008,Name_1,1001,"43,210","$987,654.32","109,876"
3/1/2008,3/31/2008,Name_1,1001,"485,079","$1,265,789,433.98","777,888"
...
12/1/2008,12/31/2008,Name_1,1001,"87,543","$432,098,987","87,987"
1/1/2008,1/31/2008,Name_2,1002,"268,456","$890,123.45","97,890"
2/1/2008,2/29/2008,Name_2,1002,"53,210","$987,654.32","109,876"
...
etc

I am trying to read them into into a pandas dataframe by using the following code: 我试图通过使用以下代码将它们读入一个pandas数据帧:

import pandas as pd

data = pd.read_csv('my_awesome_csv.csv'),parse_dates=[[0,1]],
                   infer_datetime_format=True)

This works just fine except that I would like to control the data types in each column. 这很好用,除了我想控制每列中的数据类型。 When I run the following code in the interpreter I discover that the numbers in quotes do not get recognized as numbers, either dollars or otherwise. 当我在解释器中运行以下代码时,我发现引号中的数字不会被识别为数字,无论是美元还是其他数字。

In [10]: data.dtypes
Out[10]: 
date_begin_date_end       object
name                      object
name_code                  int64
active_accounts           object  # Problem, I want this to be a number
transaction_amount        object  # Ditto, I want this to be a number (it's a dollar amount)
transaction_count         object  # Still a number!
dtype: object

I have done some snooping around in the Pandas csv documentation but haven't found what I'm looking for about declaring types that are amounts when they are saved as strings with commas and dollar signs in the csv. 我在Pandas csv文档中做了一些窥探,但是在csv中用逗号和美元符号保存为字符串时,我没有找到关于声明类型的数据。 My ultimate goal here is to be able to do some arithmetic operations on the values in these columns. 我的最终目标是能够对这些列中的值进行一些算术运算。

Any thoughts? 有什么想法吗?

You could use vectorized string methods to parse those columns after the call to read_csv : 在调用read_csv之后,您可以使用向量化字符串方法来解析这些列:

import pandas as pd
import decimal
D = decimal.Decimal

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True)

for col in ('active_accounts', 'transaction_count'):
    data[col] = data[col].str.replace(r',', '').astype(int)

data['transaction_amount'] = (data['transaction_amount']
                              .str.replace(r'[^-+\d.]', '').astype(D))


print(data.dtypes)
# date_begin_date_end    object
# name                   object
# name_code               int64
# active_accounts         int64
# transaction_amount     object
# transaction_count       int64
# dtype: object

print(data)

yields 产量

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456   
1    2/1/2008 2/29/2008  Name_1       1001            43210   
2    3/1/2008 3/31/2008  Name_1       1001           485079   
3  12/1/2008 12/31/2008  Name_1       1001            87543   
4    1/1/2008 1/31/2008  Name_2       1002           268456   
5    2/1/2008 2/29/2008  Name_2       1002            53210   

  transaction_amount  transaction_count  
0         7890123.45              67890  
1          987654.32             109876  
2      1265789433.98             777888  
3          432098987              87987  
4          890123.45              97890  
5          987654.32             109876  

PS. PS。 read_csv does have a converters parameter with which you could supply a function to parse the problematic columns. read_csv确实有一个converters参数 ,您可以使用该参数提供解析有问题列的函数。 These functions are called once for each string. 每个字符串调用一次这些函数。 If you have a lot of rows, that could be require a lot of Python function calls. 如果你有很多行,那可能需要大量的Python函数调用。 Handling the columns using vectorized string methods, as shown above should be far quicker. 使用矢量化字符串方法处理列,如上所示应该更快。

import pandas as pd
import re
import decimal
D = decimal.Decimal

def make_parser(cls):
    def parse_commas(text):
        return cls(re.sub(r'[^-+\d.]', '', text))
    return parse_commas

to_int = make_parser(int)
to_decimal = make_parser(D)

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True
                   , converters={4: to_int, 5: to_decimal, 6: to_int})

print(data)

yields 产量

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456   
1    2/1/2008 2/29/2008  Name_1       1001            43210   
2    3/1/2008 3/31/2008  Name_1       1001           485079   
3  12/1/2008 12/31/2008  Name_1       1001            87543   
4    1/1/2008 1/31/2008  Name_2       1002           268456   
5    2/1/2008 2/29/2008  Name_2       1002            53210   

  transaction_amount  transaction_count  
0         7890123.45              67890  
1          987654.32             109876  
2      1265789433.98             777888  
3          432098987              87987  
4          890123.45              97890  
5          987654.32             109876  

and the values in the transaction_amount column are decimal.Decimals: 并且transaction_amount列中的值是decimal.Decimals:

In [64]: data.loc[0, 'transaction_amount']
Out[64]: Decimal('7890123.45')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM