相互匹配兩個文件並將輸出作為文件寫入-Python

Question

我是Python的新手。 我第二次在其中編碼。 該腳本的要點是獲取一個包含數千行文件名的文本文件（sNotUsed文件），並將其與大約50個XML文件進行匹配。 每個XML文件最多可以包含數千行，並且按照大多數XML的格式進行格式化。 我不確定到目前為止代碼的問題是什么。 代碼沒有完全完成，因為我還沒有添加將輸出寫回到XML文件的部分，但是當前的最后一行應該至少打印一次。 事實並非如此。

兩種文件格式的示例如下：

文本文件：

fileNameWithoutExtension1
fileNameWithoutExtension2
fileNameWithoutExtension3
etc.

XML文件：

<blocks> 

<more stuff="name"> 
     <Tag2> 
        <Tag3 name="Tag3">
                 <!--COMMENT-->
                 <fileType>../../dir/fileNameWithoutExtension1</fileType>
                 <fileType>../../dir/fileNameWithoutExtension4</fileType>
</blocks>

我的代碼這么遠：

import os
import re

sNotUsed=list()
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open snotused txt file
for lines in sFile:
    sNotUsed.append(lines)
#sNotUsed = sFile.readlines() # read all lines and assign to list
sFile.close() # close file

xmlFiles=list() # list of xmlFiles in directory
usedS=list() # list of S files that do not match against sFile txt

search = "\w/([\w\-]+)"

# getting the list of xmlFiles
filelist=os.listdir('C:\Users\xxx\Desktop\dir')
for files in filelist:
    if files.endswith('.xml'):
        xmlFile = open(files, "r+") # open first file with read + write access
        xmlComp = xmlFile.readlines() # read lines and assign to list
        for lines in xmlComp: # iterate by line in list of lines
            temp = re.findall(search, lines)
            #print temp
            if temp:
                if temp[0] in sNotUsed:
                    print "yes" # debugging. I know there is at least one match for sure, but this is not being printed.

解決問題：對不起，我想我的問題不是很清楚。 我希望腳本逐行瀏覽每個XML，並查看該行的FILENAME部分是否與sNotUsed.txt文件的確切行匹配。 如果有匹配項，那么我想將其從XML中刪除。 如果該行與sNotUsed.txt中的任何行都不匹配，那么我希望它是新的經過修改的XML文件（它將覆蓋舊文件）的輸出的一部分。 如果仍然不清楚，請告訴我。

編輯，工作代碼

import os
import re
import codecs

sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open sNotUsed txt file
sNotUsed=sFile.readlines() # read all lines and assign to list
sFile.close() # close file

search = re.compile(r"\w/([\w\-]+)")

sNotUsed=[x.strip().replace(',','') for x in sNotUsed]
directory=r'C:\Users\xxx\Desktop\dir'
filelist=os.listdir(directory) # getting the list of xmlFiles
# for each file in the list
for files in filelist:
    if files.endswith('.xml'): # make sure it is an XML file
        xmlFile = codecs.open(os.path.join(directory, files), "r", encoding="UTF-8") # open first file with read
        xmlComp = xmlFile.readlines() # read lines and assign to list
        print xmlComp
        xmlFile.close() # closing the file since the lines have already been read and assigned to a variable
        xmlEdit = codecs.open(os.path.join(directory, files), "w", encoding="UTF-8") # opening the same file again and overwriting all existing lines
        for lines in xmlComp: # iterate by line in list of lines
            #headerInd = re.search(search, lines) # used to get the headers, comments, and ending blocks
            temp = re.findall(search, lines) # finds all strings that match the regular expression compiled above and makes a list for each
            if temp: # if the list is not empty
                if temp[0] not in sNotUsed: # if the first (and only) value in each list is not in the sNotUsed list
                    xmlEdit.write(lines) # write it in the file
            else: # if the list is empty
                xmlEdit.write(lines) # write it (used to preserve the beginning and ending blocks of the XML, as well as comments)

Answer 1

有很多話要說，但我會盡量保持簡潔。

PEP8：Python代碼樣式指南

對於局部變量，應使用帶下划線的小寫字母。 看看PEP8：Python代碼樣式指南。

文件對象和`with`語句

使用with語句打開文件，請參見：文件對象： http : //docs.python.org/2/library/stdtypes.html#bltin-file-objects

轉義Windows文件名

Windows文件名中的反斜杠可能會導致Python程序出現問題。 您必須使用雙反斜杠轉義字符串或使用原始字符串。

例如：如果您的Windows文件名是"dir\\notUsed.txt" ，則應這樣轉義它： "dir\\\\notUsed.txt"或使用原始字符串r"dir\\notUsed.txt" 。 如果您不這樣做，則"\\n"將被解釋為換行符！

注意：如果需要支持Unicode文件名，則可以使用Unicode原始字符串： ur"dir\\notUsed.txt" 。

另請參見StockOverFlow中的問題19065115。

將文件名存儲在一個set ：這是一個優化的集合，沒有重復項

not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
    not_used_set = set([line.strip() for line in not_used_file])

編譯您的正則表達式

多次使用時，編譯正則表達式效率更高。 同樣，您應該使用原始字符串來避免反斜杠的解釋。

pattern = re.compile(r"\w/([\w\-]+)")

警告： os.listdir()函數返回文件名列表而不是完整路徑列表。 請參閱Python文檔中的此函數。

在您的示例中，使用os.listdir()讀取桌面目錄'C:\\Users\\xxx\\Desktop\\dir' 。 然后，您要使用open(files, "r+")此目錄中的每個XML文件。 但這是錯誤的，直到您當前的工作目錄不是您的桌面目錄。 經典用法是使用os.path.join()函數，如下所示：

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    desktop_path = os.path.join(desktop_dir, filename)

如果要提取文件名的擴展名，則可以使用os.path.splitext()函數。

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    if os.path.splitext(filename)[1].lower() != '.xml':
        continue
    desktop_path = os.path.join(desktop_dir, filename)

您可以使用理解列表簡化此操作：

desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
            for filename in os.listdir(desktop_dir)
            if os.path.splitext(filename)[1].lower() == '.xml']

解析XML文件

如何解析XML文件？ 這是一個很好的問題！ 有幾種可能：-使用正則表達式，高效但危險； -使用SAX解析器，效率也很高，但令人困惑且難以維護； -使用DOM解析器，效率較低，但更清晰...考慮使用lxml包（@see： http : //lxml.de/ ）

這很危險，因為您讀取文件的方式並不關心XML編碼。 這很糟糕！ 確實非常糟糕！ XML文件通常以UTF-8編碼。 您應該首先解碼UTF-8字節流。 一種簡單的方法是使用codecs.open（）打開編碼的文件。

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()

使用此解決方案，完整的XML內容作為Unicode字符串存儲在content變量中。 然后，您可以使用Unicode正則表達式來解析內容。

最后，您可以使用交集來查找給定的XML文件是否包含文本文件的公用名稱。

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()
    actual_set = set(pattern.findall(content))
    print(not_used_set & actual_set)

相互匹配兩個文件並將輸出作為文件寫入-Python

問題描述

1 個解決方案

解決方案1
5 已采納 2013-10-28 22:40:30

PEP8：Python代碼樣式指南

文件對象和`with`語句

轉義Windows文件名

編譯您的正則表達式

解析XML文件

相互匹配兩個文件並將輸出作為文件寫入-Python

問題描述

1 個解決方案

解決方案1 5 已采納 2013-10-28 22:40:30

PEP8：Python代碼樣式指南

文件對象和with語句

轉義Windows文件名

編譯您的正則表達式

解析XML文件

解決方案1
5 已采納 2013-10-28 22:40:30

文件對象和`with`語句