简体   繁体   English

最大限度地减少在大型 CSV 文件中搜索 python 的时间

[英]Minimise search time for python in a large CSV file

I have a CSV file with about 700 or so rows and 8 columns, the last column however, has a very big block of text (with enough for multiple long paragraphs inside each).我有一个大约有 700 行和 8 列的 CSV 文件,但是最后一列有一个非常大的文本块(每个文本块都足够容纳多个长段落)。

I'd like to implement through python a text-search function that gives me back all the lines that have text that matches from inside the data from the 8th column (meaning it'd need to go through the whole thing).我想通过 python 实现一个文本搜索功能,该功能使我返回所有具有与第 8 列数据内部匹配的文本的行(这意味着它需要遍历整个过程)。

What could possibly be the quickest way to approach this and minimise search-time?什么可能是解决这个问题并最小化搜索时间的最快方法?

You could dump your csv file into an sqlite database and use sqlite's full text search capabilities to do the search for you.您可以将 csv 文件转储到sqlite数据库中,然后使用 sqlite 的全文搜索功能为您进行搜索。

This example code shows how it could be done.此示例代码显示了如何完成。 There are a few things to be aware of:有几件事情需要注意:

  • it assumes that the csv file has a header row, and that the values of the headers will make legal column names in sqlite.它假设 csv 文件有一个标题行,并且标题的值将在 sqlite 中成为合法的列名。 If this isn't the case, you'll need to quote them (or just use generic names like "col1", "col2" etc).如果不是这种情况,您需要引用它们(或者只使用通用名称,如“col1”、“col2”等)。
  • it searches all columns in the csv;它搜索 csv 中的所有列; if that's undesirable, filter out the other columns (and header values) before creating the SQL statements.如果这是不可取的,请在创建 SQL 语句之前过滤掉其他列(和标题值)。
  • If you want to be able to match the results to rows in the csv file, you'll need create a column that contains the line number.如果您希望能够将结果与 csv 文件中的行进行匹配,则需要创建一个包含行号的列。
import csv
import sqlite3
import sys


def create_table(conn, headers, name='mytable'):
    cols = ', '.join([x.strip() for x in headers])
    stmt = f"""CREATE VIRTUAL TABLE {name} USING fts5({cols})"""
    with conn:
        conn.execute(stmt)
    return


def populate_table(conn, reader, ncols, name='mytable'):
    placeholders = ', '.join(['?'] * ncols)
    stmt = f"""INSERT INTO {name}
    VALUES ({placeholders})
    """
    with conn:
        conn.executemany(stmt, reader)
    return


def search(conn, term, headers, name='mytable'):
    cols = ', '.join([x.strip() for x in headers])
    stmt = f"""SELECT {cols}
    FROM {name}
    WHERE {name} MATCH ?
    """
    with conn:
        cursor = conn.cursor()
        cursor.execute(stmt, (term,))
        result = cursor.fetchall()
    return result


def main(path, term):
    result = 'NO RESULT SET'
    try:
        # Create an in-memory database.
        conn = sqlite3.connect(':memory:')
        with open(path, 'r') as f:
            reader = csv.reader(f)
            # Assume headers are in the first row
            headers = next(reader)
            create_table(conn, headers)
            ncols = len(headers)
            populate_table(conn, reader, ncols)
        result = search(conn, term, headers)
    finally:
        conn.close()
    return result


if __name__ == '__main__':
    print(main(*sys.argv[1:]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM