[英]Minimise search time for python in a large CSV file
I have a CSV file with about 700 or so rows and 8 columns, the last column however, has a very big block of text (with enough for multiple long paragraphs inside each).我有一个大约有 700 行和 8 列的 CSV 文件,但是最后一列有一个非常大的文本块(每个文本块都足够容纳多个长段落)。
I'd like to implement through python a text-search function that gives me back all the lines that have text that matches from inside the data from the 8th column (meaning it'd need to go through the whole thing).我想通过 python 实现一个文本搜索功能,该功能使我返回所有具有与第 8 列数据内部匹配的文本的行(这意味着它需要遍历整个过程)。
What could possibly be the quickest way to approach this and minimise search-time?什么可能是解决这个问题并最小化搜索时间的最快方法?
You could dump your csv file into an sqlite database and use sqlite's full text search capabilities to do the search for you.您可以将 csv 文件转储到sqlite数据库中,然后使用 sqlite 的全文搜索功能为您进行搜索。
This example code shows how it could be done.此示例代码显示了如何完成。 There are a few things to be aware of:
有几件事情需要注意:
import csv
import sqlite3
import sys
def create_table(conn, headers, name='mytable'):
cols = ', '.join([x.strip() for x in headers])
stmt = f"""CREATE VIRTUAL TABLE {name} USING fts5({cols})"""
with conn:
conn.execute(stmt)
return
def populate_table(conn, reader, ncols, name='mytable'):
placeholders = ', '.join(['?'] * ncols)
stmt = f"""INSERT INTO {name}
VALUES ({placeholders})
"""
with conn:
conn.executemany(stmt, reader)
return
def search(conn, term, headers, name='mytable'):
cols = ', '.join([x.strip() for x in headers])
stmt = f"""SELECT {cols}
FROM {name}
WHERE {name} MATCH ?
"""
with conn:
cursor = conn.cursor()
cursor.execute(stmt, (term,))
result = cursor.fetchall()
return result
def main(path, term):
result = 'NO RESULT SET'
try:
# Create an in-memory database.
conn = sqlite3.connect(':memory:')
with open(path, 'r') as f:
reader = csv.reader(f)
# Assume headers are in the first row
headers = next(reader)
create_table(conn, headers)
ncols = len(headers)
populate_table(conn, reader, ncols)
result = search(conn, term, headers)
finally:
conn.close()
return result
if __name__ == '__main__':
print(main(*sys.argv[1:]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.