简体   繁体   English

Python:比较一个文本文件中的正则表达式模式与另一个文本文件中的行

[英]Python: Compare regex pattern in one text file against line in another

This works in smaller text files, but not on larger. 这适用于较小的文本文件,但不适用于较大的文本文件 (100,000 lines) How can I optimize for large text files? (100,000行)如何优化大型文本文件? For line in fileA if regexPattern == line in fileB write (entire)line in fileA to fileC. 对于fileA中的行,如果regexPattern == fileB中的行将fileA中的(整个)行写入fileC。

import re

with open('fileC.txt', 'w') as outfile:
    with open('fileA.txt', 'rU') as infile1:
        for line1 in infile1:
            y = re.findall(r'^.+,.+,(.+\.[a-z]+$)', line1)
                with open('fileB.txt', 'rU') as infile2:
                    for line2 in infile2:
                        if line2.strip() == y[0]:
                            outfile.write(line1)

The most immediate optimization is to read fileB.txt only once into a string buffer, then apply the test against the matched expression to that string buffer. 最直接的优化是只将fileB.txt读入字符串缓冲区一次,然后将匹配表达式的测试应用于该字符串缓冲区。 You are currently opening and reading that file once for each line of fileA.txt . 您当前正在为fileA.txt每一行打开和读取该文件一次。

It seems that your regex picks up whole lines that match a pattern, ie it starts with ^ and ends with $ . 似乎你的正则表达式选择了匹配模式的整行,即它以^开头并以$结尾。 In this case, a more complete solution would be to load both fileA.txt and fileB.txt into arrays using readlines() , sort those arrays, then take a single pass through both files with two counters, eg: 在这种情况下,更完整的解决方案是使用readlines()fileA.txtfileB.txt到数组中,对这些数组进行排序,然后通过两个计数器单次传递两个文件,例如:

# Details regarding the treatment of duplicate lines are ignored
# for clarity of exposition.
rai = sorted([7,6,1,9,11,6])
raj = sorted([4,6,11,7])
i, j = 0, 0
while i < len(rai) and j < len(raj):
    if   rai[i] < raj[j]: i += 1
    elif rai[i] > raj[j]: j += 1
    else:
        # I used mod in lieu of testing for your regex
        # since you didnt supply data
        if mod(rai[i],2): print rai[i]
        i, j = i + 1, j + 1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python正则表达式与模式不匹配 - Python regex not matching against pattern 使用Python从一个文本文件一行一行地复制到另一个文本文件 - Copying line by line from one text file to another using Python Python:如何检查文本文件,将其与另一个文本文件中的每一行进行比较,并打印不匹配的行 - Python: How to check a text file, compare it to each line in another text file, and print the lines that do not match 用于在文本文件中搜索模式的 Python 正则表达式 - Python Regex for Searching pattern in text file Python在一个文本文件中搜索值,将它们与另一个文本文件中的值进行比较,然后在匹配时替换值 - Python to search values in one text file, compare them with values in another text file, then replace values if there is a match 用于将文本文件的一行与另一个文件的每一行进行比较的条件语句 - Conditional Statement to compare line of a text file to each line of another file 使用 RSA 在 Python 中逐行加密文件并将其与另一个文件进行比较 - Encrypt file Line by Line in Python using RSA and Compare it to another File 在Python中的csv文件中比较一列到另一列 - compare one column to another in csv file in Python Python regex:在文本文件中获取正则表达式模式并存储在数组或列表中 - Python regex: Get regex pattern in a text file and store in an array or list 将两个数组及其元素相互比较 - Compare two arrays and their elements against one another
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM