简体   繁体   English

Python列表与MySQL Select性能

[英]Python list vs. MySQL Select performance

I have a large list with 15k entries in a MySQL table from which I need to select a few items, many times. 我在MySQL表中有一个包含15k条目的大型列表,我需要从中多次选择一些项。 For example, I might want all entries with a number field between 1 and 10. 例如,我可能希望所有条目的数字字段在1到10之间。

In SQL this would be easy: 在SQL中,这很容易:

SELECT text FROM table WHERE number>=1 AND number<10; 

If I extract the entire table to a Python list: 如果我将整个表提取到Python列表中:

PyList = [[text1, number1], [text2, number2], ...]

I could then extract those same text values I want by running through the entire list 然后,我可以通过遍历整个列表来提取所需的相同文本值

for item in PyList
    if item[1] >=1 and item[1]<10:
        result.append(item[0])

Now, the performance question between the two is that I have to do this for a sliding window. 现在,两者之间的性能问题是我必须为滑动窗口执行此操作。 I want to get those between 1 and 10, then 2 and 11, 3 and 12, ... 14990 and 15000 What approach is faster for a list this big? 我想要得到1到10之间的值,然后是2到11、3和12之间的值... 14990和15000对于这么大的列表,哪种方法更快?

An improvement in Python I'm thinking about is to pre-order the Python list by number. 我正在考虑对Python进行的一项改进是按编号对Python列表进行预排序。 When the window moves I could remove the lowest value from result and append all elements verifying the next condition to get the new result . 当窗口移动时,我可以从result删除最小值,并附加所有元素以验证下一个条件以获得新result I would also keep track of index in the PyList so I would know where to start from in the next iteration. 我还将跟踪PyList中的索引,以便在下一次迭代中知道从何处开始。 This would spare me from running through the entire list again. 这样可以避免我再次遍历整个列表。

I don't know how to speed up the MySQL for successive Selects that are very similar and I don't know how it works internally to understand performance differences between the two approaches. 我不知道如何加快非常相似的连续Select的MySQL,也不知道它在内部如何理解两种方法之间的性能差异。

How would you implement this? 您将如何实施?

Simply define an index over number in your database, then the database can generate the result sets instantly. 只需在数据库中定义number索引,数据库即可立即生成结果集。 Plus it can do some calculations on these sets too, if that is your next step. 另外,如果这是您的下一步,它也可以对这些集合进行一些计算。

Databases are actually great at such queries, I'd let it do its job before trying something else. 数据库实际上很擅长此类查询,在尝试其他方法之前,我会让它完成其工作。

It's certainly going to be much faster to pull the data into memory than run ~15,000 queries. 将数据拉入内存肯定比运行约15,000个查询要快得多。

My advice is to make sure the SQL query sorts the data by number . 我的建议是确保SQL查询按number对数据进行排序。 If the data is sorted, you can use the very fast lookup methods in the bisect standard library module to find indexes. 如果对数据进行了排序,则可以使用bisect标准库模块中的快速查找方法来查找索引。

Read all the data into Python (from the numbers you mention it should handily fit in memory), say into a variable pylist as you say, then prep an auxiliary data structure as follows: 将所有数据读入Python(从您提到的数字应该方便地放入内存中),按您所说的pylist入变量pylist中,然后准备如下的辅助数据结构:

import collections
d = collections.defaultdict(list)
for text, number in pylist:
  d[number].append(text)

Now, to get all texts for numbers between low included and high excluded, 现在,要获取包含在low包含和high排除之间的数字的所有文本,

def slidingwindow(d, low, high):
    result = []
    for x in xrange(low, high):
        result.extend(d.get(x, ()))
    return result

It is difficult to answer without actual performance, but my gut feeling is that it would be better to go for the SQL with bind variables (I am not MySQL expert, but in this case query syntax should be something like %varname). 没有实际的性能很难回答,但是我的直觉是最好使用带有绑定变量的SQL(我不是MySQL专家,但是在这种情况下,查询语法应该类似于%varname)。

The reason is that you would return data only when needed (thus user interface would be responsive much in advance) and you would rely on a system highly optimized for that kind of operation. 原因是您仅在需要时才返回数据(因此用户界面会提前做出响应),并且您将依赖针对此类操作进行高度优化的系统。 On the other hand, retrieving a larger chunk of data i usually faster than retrieving smaller ones, so the "full python" approach could have its edge. 另一方面,检索较大的数据通常比检索较小的数据更快,因此“完整python”方法可能具有优势。

However, unless you have serious performance issues, I would still stick in using SQL, because it would lead to much simpler code, to read and understand. 但是,除非遇到严重的性能问题,否则我仍然会坚持使用SQL,因为这样做会使阅读和理解的代码简单得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM