[英]Fastest way to do data type conversion using csv.DictReader in python
我正在使用python中的CSV文件工作,使用时将有100,000行。 每行都有一组维度(如字符串)和一个度量(浮点数)。
由于csv.DictReader或csv.reader仅以字符串形式返回值,因此我目前正在遍历所有行并将一个数字值转换为浮点数。
for i in csvDict:
i[col] = float(i[col])
有没有更好的方法可以让任何人建议这样做? 我一直在尝试使用map,izip,itertools的各种组合,并广泛搜索了一些更有效地执行此操作的示例,但是不幸的是,并没有取得太大的成功。
以防万一:我正在appengine上执行此操作。 我相信我正在做的事情可能导致我遇到此错误:总共为11个请求提供服务后,超出了267.789 MB的软进程大小限制-我仅在CSV很大时才得到它。
编辑:我的目标我正在解析此CSV,以便可以将其用作Google Visualizations API的数据源 。 最终的数据集将被加载到gviz DataTable中进行查询。 在构造此表期间必须指定类型。 如果有人知道Python中有一个不错的gviz csv-> datatable转换器,我的问题也可以解决!
Edit2:我的代码
我认为我的问题与尝试fixCsvTypes()的方式有关。 另外,data_table.LoadData()需要一个可迭代的对象。
class GvizFromCsv(object):
"""Convert CSV to Gviz ready objects."""
def __init__(self, csvFile, dateTimeFormat=None):
self.fileObj = StringIO.StringIO(csvFile)
self.csvDict = list(csv.DictReader(self.fileObj))
self.dateTimeFormat = dateTimeFormat
self.headers = {}
self.ParseHeaders()
self.fixCsvTypes()
def IsNumber(self, st):
try:
float(st)
return True
except ValueError:
return False
def IsDate(self, st):
try:
datetime.datetime.strptime(st, self.dateTimeFormat)
except ValueError:
return False
def ParseHeaders(self):
"""Attempts to figure out header types for gviz, based on first row"""
for k, v in self.csvDict[0].items():
if self.IsNumber(v):
self.headers[k] = 'number'
elif self.dateTimeFormat and self.IsDate(v):
self.headers[k] = 'date'
else:
self.headers[k] = 'string'
def fixCsvTypes(self):
"""Only fixes numbers."""
update_to_numbers = []
for k,v in self.headers.items():
if v == 'number':
update_to_numbers.append(k)
for i in self.csvDict:
for col in update_to_numbers:
i[col] = float(i[col])
def CreateDataTable(self):
"""creates a gviz data table"""
data_table = gviz_api.DataTable(self.headers)
data_table.LoadData(self.csvDict)
return data_table
我首先使用正则表达式来利用CSV文件,但是由于文件中的数据非常严格地排列在每一行中,因此我们可以简单地使用split()函数
import gviz_api
scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)
# --- lines in surnames.csv are : ---
# surname,percent,cumulative percent,rank\n
# SMITH,1.006,1.006,1,\n
# JOHNSON,0.810,1.816,2,\n
# WILLIAMS,0.699,2.515,3,\n
with open('surnames.csv') as f:
def transf(surname,x,y):
return (surname,float(x),float(y))
f.readline()
# to skip the first line surname,percent,cumulative percent,rank\n
data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
# to populate the data table by iterating in the CSV file
或没有要定义的功能:
import gviz_api
scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)
# --- lines in surnames.csv are : ---
# surname,percent,cumulative percent,rank\n
# SMITH,1.006,1.006,1,\n
# JOHNSON,0.810,1.816,2,\n
# WILLIAMS,0.699,2.515,3,\n
with open('surnames.csv') as f:
f.readline()
# to skip the first line surname,percent,cumulative percent,rank\n
datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )
# to populate the data table by iterating in the CSV file
有一瞬间,我认为我不得不一次在数据表中填充一行,因为我使用的是正则表达式,并且需要在浮出数字字符串之前获取匹配组。 使用split(),所有操作都可以通过LoadData()在一条指令中完成。
。
因此,您的代码可以缩短。 顺便说一句,我不明白为什么它应该继续定义一个类。 相反,一个功能对我来说似乎足够了:
def GvizFromCsv(filename):
""" creates a gviz data table from a CSV file """
data_table = gviz_api.DataTable([('col1','string','SURNAME'),
('col2','number','ONE' ),
('col3','number','TWO' ) ])
# --- with such a table schema , lines in the file must be like that: ---
# blah, number, number, ...anything else...\n
# SMITH,1.006,1.006, ...anything else...\n
# JOHNSON,0.810,1.816, ...anything else...\n
# WILLIAMS,0.699,2.515, ...anything else...\n
with open(filename) as f:
data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
for line in f )
return data_table
。
现在,您必须检查是否可以在此代码中插入从另一个API读取CSV数据的方式,以保持迭代原理来填充数据表。
首先,如果只需要可视化这些数据就不需要任何转换:gviz可以处理JSON(您知道基于文本的文本)或CSV(您已经拥有它,不需要解析!)。 您可以将文件放在任何合理的Web服务器上,并通过忽略参数来允许它通过GET请求gviz问题进行访问。
但是,假设您需要处理。 看起来您不仅在读取CSV文件,而且还尝试将其完全存储在RAM中。 这可能是不切实际的:随着添加更多处理,您将越来越快达到RAM限制。 一次处理一行数据(如果应用窗口过滤器等,则一次处理合理的行数),并将处理过的行放入数据存储区,而不放入任何列表等。同样,当通过GET请求提供数据时,读/处理一行,将其写到响应中,不要将其放入任何列表中。
我认为转换技术没有问题,只要您稍后在代码中合理地使用i
,并且不要随便记住所有i
即可。
有两件事:“数据源”和“数据表”。
“数据源”是Google Visualization API服务器作为Visualization Web服务传递的格式化数据的名称:
This page describes how you can implement a data source to feed data
to visualizations built on the Google Visualization API.
http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source.html
名称“数据源”包括“有线协议”的概念:
In response [to a request], the data source returns properly formatted data
that the visualization can use to render the graphic on the page.
This request-response protocol is known as the Google Visualization API wire protocol,
http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html
要实现“数据源”,有两种可能性:
• Use one of the data source libraries listed in the Data Sources and Tools Gallery.
All the data source libraries listed on that page implement the wire protocol.
• Write your own data source from scratch,
http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html
从以下内容:
• ... Data Sources and Tools Gallery : (....) You therefore need write only the
code needed to make your data available to the library in the form of a data table.
• Write your own data source from scratch, as described in the
Writing your own Data Source
我了解从头开始,我们需要自己实现有线协议+创建“数据表”,而对于数据源库,我们只需要创建“数据表”即可。
有关于创建“数据源”的页面
http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html
我认为,地址为http://groups.google.com/group/google-visualization-api/browse_thread/thread/9d1d941e0f0b32ed的示例是关于“数据源”的创建的,在此答案令人怀疑。 但这对我来说还不是很清楚。
但是这些页面和主题对您而言并不有趣,事实上,如果我理解的话,他们希望知道如何准备通过“数据源”提供的数据(称为“数据表”),但是不是“数据源”的构造。
3.Prepare your data. You'll need to prepare the data to visualize;
this means either specifying the data yourself in code,
or querying a remote site for data.
http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#keycomponents
A visualization stores the data that it visualizes as two-dimensional data table with
rows and columns.
Cells are referenced by (row, column) where row is a zero-based row number, and column
is either a zero-based column index or a unique ID that you can specify.
http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata
因此,准备“数据表”是关键。
这里是:
There are two ways to create/populate your visualization's data table:
•Query a data provider. A data provider is another site that returns
a populated DataTable in response to a request from your code.
Some data providers also accept SQL-like query strings to sort or
filter the data. See Data Queries for more information and an example
of a query.
•Create and populate your own DataTable by hand. You can populate your
DataTable in code on your page. The simplest way to do this is to create
a DataTable object without any data and populate it by calling addRows()
on it. You can also pass a JavaScript literal representation of the data
table into the DataTable constructor, but this is more complex and is
covered on the reference page.
http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata
在此处找到更多信息:
2. Describe your table schema
The table schema is specified by the table_description parameter
passed into the constructor. You cannot change it later.
The schema describes all the columns in the table: the data type of
each column, the ID, and an optional label.
Each column is described by a tuple: (ID [,data_type [,label [,custom_properties]]]).
The table schema is a collection of column descriptor tuples.
Every list member, dictionary key or dictionary value must be either
another collection or a descriptor tuple. You can use any combination
of dictionaries or lists, but every key, value, or member must
eventually evaluate to a descriptor tuple. Here are some examples.
•List of columns: [('a', 'number'), ('b', 'string')]
•Dictionary of lists: {('a', 'number'): [('b', 'number'), ('c', 'string')]}
•Dictionary of dictionaries: {('a', 'number'): {'b': 'number', 'c': 'string'}}
•And so on, with any level of nesting.
3. Populate your data
To add data to the table, build a structure of data elements in the
exact same structure as the table schema. So, for example, if your
schema is a list, the data must be a list:
•schema: [("color", "string"), ("shape", "string")]
•data: [["blue", "square"], ["red", "circle"]]
If the schema is a dictionary, the data must be a dictionary:
•schema: {("rowname", "string"): [("color", "string"), ("shape", "string")] }
•data: {"row1": ["blue", "square"], "row2": ["red", "circle"]}
http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html#populatedata
最后,我要说的是,对于您的问题,您必须定义“表模式”并处理CSV文件,以便获得a structure of data elements in the exact same structure as the table schema.
列中数据类型的定义是在“表模式”的定义中完成的。 如果必须使用正确类型的数据(不是我要说的字符串)填充“数据表”,我将帮助您编写用于从CSV提取数据的代码,这很容易做到。
目前,我希望所有这些都是对的,并将对您有所帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.