相互匹配两个文件并将输出作为文件写入-Python

Question

I'm new to Python. 我是Python的新手。 My second time coding in it. 我第二次在其中编码。 The main point of this script is to take a text file that contains thousands of lines of file names (sNotUsed file) and match it against about 50 XML files. 该脚本的要点是获取一个包含数千行文件名的文本文件（sNotUsed文件），并将其与大约50个XML文件进行匹配。 The XML files may contain up to thousands of lines each and are formatted as most XML's are. 每个XML文件最多可以包含数千行，并且按照大多数XML的格式进行格式化。 I'm not sure what the problem with the code so far is. 我不确定到目前为止代码的问题是什么。 The code is not fully complete as I have not added the part where it writes the output back to an XML file, but the current last line should be printing at least once. 代码没有完全完成，因为我还没有添加将输出写回到XML文件的部分，但是当前的最后一行应该至少打印一次。 It is not, though. 事实并非如此。

Examples of the two file formats are as follows: 两种文件格式的示例如下：

TEXT FILE: 文本文件：

fileNameWithoutExtension1
fileNameWithoutExtension2
fileNameWithoutExtension3
etc.

XML FILE: XML文件：

<blocks> 

<more stuff="name"> 
     <Tag2> 
        <Tag3 name="Tag3">
                 <!--COMMENT-->
                 <fileType>../../dir/fileNameWithoutExtension1</fileType>
                 <fileType>../../dir/fileNameWithoutExtension4</fileType>
</blocks>

MY CODE SO FAR: 我的代码这么远：

import os
import re

sNotUsed=list()
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open snotused txt file
for lines in sFile:
    sNotUsed.append(lines)
#sNotUsed = sFile.readlines() # read all lines and assign to list
sFile.close() # close file

xmlFiles=list() # list of xmlFiles in directory
usedS=list() # list of S files that do not match against sFile txt

search = "\w/([\w\-]+)"

# getting the list of xmlFiles
filelist=os.listdir('C:\Users\xxx\Desktop\dir')
for files in filelist:
    if files.endswith('.xml'):
        xmlFile = open(files, "r+") # open first file with read + write access
        xmlComp = xmlFile.readlines() # read lines and assign to list
        for lines in xmlComp: # iterate by line in list of lines
            temp = re.findall(search, lines)
            #print temp
            if temp:
                if temp[0] in sNotUsed:
                    print "yes" # debugging. I know there is at least one match for sure, but this is not being printed.

TO HELP CLEAR THINGS UP: Sorry, I guess my question wasn't very clear. 解决问题：对不起，我想我的问题不是很清楚。 I would like the script to go through each XML line by line and see if the FILENAME part of that line matches with the exact line of the sNotUsed.txt file. 我希望脚本逐行浏览每个XML，并查看该行的FILENAME部分是否与sNotUsed.txt文件的确切行匹配。 If there is match then I want to delete it from the XML. 如果有匹配项，那么我想将其从XML中删除。 If the line doesn't match any of the lines in the sNotUsed.txt then I would like it be part of the output of the new modified XML file (which will overwrite the old one). 如果该行与sNotUsed.txt中的任何行都不匹配，那么我希望它是新的经过修改的XML文件（它将覆盖旧文件）的输出的一部分。 Please let me know if still not clear. 如果仍然不清楚，请告诉我。

EDITED, WORKING CODE 编辑，工作代码

import os
import re
import codecs

sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open sNotUsed txt file
sNotUsed=sFile.readlines() # read all lines and assign to list
sFile.close() # close file

search = re.compile(r"\w/([\w\-]+)")

sNotUsed=[x.strip().replace(',','') for x in sNotUsed]
directory=r'C:\Users\xxx\Desktop\dir'
filelist=os.listdir(directory) # getting the list of xmlFiles
# for each file in the list
for files in filelist:
    if files.endswith('.xml'): # make sure it is an XML file
        xmlFile = codecs.open(os.path.join(directory, files), "r", encoding="UTF-8") # open first file with read
        xmlComp = xmlFile.readlines() # read lines and assign to list
        print xmlComp
        xmlFile.close() # closing the file since the lines have already been read and assigned to a variable
        xmlEdit = codecs.open(os.path.join(directory, files), "w", encoding="UTF-8") # opening the same file again and overwriting all existing lines
        for lines in xmlComp: # iterate by line in list of lines
            #headerInd = re.search(search, lines) # used to get the headers, comments, and ending blocks
            temp = re.findall(search, lines) # finds all strings that match the regular expression compiled above and makes a list for each
            if temp: # if the list is not empty
                if temp[0] not in sNotUsed: # if the first (and only) value in each list is not in the sNotUsed list
                    xmlEdit.write(lines) # write it in the file
            else: # if the list is empty
                xmlEdit.write(lines) # write it (used to preserve the beginning and ending blocks of the XML, as well as comments)

Answer 1

There is a lot of things to say but I'll try to stay concise. 有很多话要说，但我会尽量保持简洁。

PEP8: Style Guide for Python Code PEP8：Python代码样式指南

You should use lower case with underscores for local variables. 对于局部变量，应使用带下划线的小写字母。 take a look at the PEP8: Style Guide for Python Code. 看看PEP8：Python代码样式指南。

File objects and `with` statement 文件对象和`with`语句

Use the with statement to open a file, see: File Objects: http://docs.python.org/2/library/stdtypes.html#bltin-file-objects 使用with语句打开文件，请参见：文件对象： http : //docs.python.org/2/library/stdtypes.html#bltin-file-objects

Escape Windows filenames 转义Windows文件名

Backslashes in Windows filenames can cause problems in Python programs. Windows文件名中的反斜杠可能会导致Python程序出现问题。 You must escape the string using double backslashes or use raw strings. 您必须使用双反斜杠转义字符串或使用原始字符串。

For example: if your Windows filename is "dir\\notUsed.txt" , you should escape it like this: "dir\\\\notUsed.txt" or use a raw string r"dir\\notUsed.txt" . 例如：如果您的Windows文件名是"dir\\notUsed.txt" ，则应这样转义它： "dir\\\\notUsed.txt"或使用原始字符串r"dir\\notUsed.txt" 。 If you don't do that, the "\\n" will be interpreted as a newline! 如果您不这样做，则"\\n"将被解释为换行符！

Note: if you need to support Unicode filenames, you can use Unicode raw strings: ur"dir\\notUsed.txt" . 注意：如果需要支持Unicode文件名，则可以使用Unicode原始字符串： ur"dir\\notUsed.txt" 。

See also the question 19065115 in StockOverFlow. 另请参见StockOverFlow中的问题19065115。

store the filenames in a set : it is an optimized collection without duplicates 将文件名存储在一个set ：这是一个优化的集合，没有重复项

not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
    not_used_set = set([line.strip() for line in not_used_file])

Compile your regex 编译您的正则表达式

It is more efficient to compile a regex when used numerous times. 多次使用时，编译正则表达式效率更高。 Again, you should use raw strings to avoid backslashes interpretation. 同样，您应该使用原始字符串来避免反斜杠的解释。

pattern = re.compile(r"\w/([\w\-]+)")

Warning: os.listdir() function return a list of filenames not a list of full paths. 警告： os.listdir()函数返回文件名列表而不是完整路径列表。 See this function in the Python documentation. 请参阅Python文档中的此函数。

In your example, you read a desktop directory 'C:\\Users\\xxx\\Desktop\\dir' with os.listdir() . 在您的示例中，使用os.listdir()读取桌面目录'C:\\Users\\xxx\\Desktop\\dir' 。 And then you want to open each XML file in this directory with open(files, "r+") . 然后，您要使用open(files, "r+")此目录中的每个XML文件。 But this is wrong, until your current working directory isn't your desktop directory. 但这是错误的，直到您当前的工作目录不是您的桌面目录。 The classic usage is to used os.path.join() function like this: 经典用法是使用os.path.join()函数，如下所示：

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    desktop_path = os.path.join(desktop_dir, filename)

If you want to extract the filename's extension, you can use the os.path.splitext() function. 如果要提取文件名的扩展名，则可以使用os.path.splitext()函数。

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    if os.path.splitext(filename)[1].lower() != '.xml':
        continue
    desktop_path = os.path.join(desktop_dir, filename)

You can simplify this with a comprehension list: 您可以使用理解列表简化此操作：

desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
            for filename in os.listdir(desktop_dir)
            if os.path.splitext(filename)[1].lower() == '.xml']

Parse a XML file 解析XML文件

How to parse a XML file? 如何解析XML文件？ This is a great question! 这是一个很好的问题！ There a several possibility: - use regex, efficient but dangerous; 有几种可能：-使用正则表达式，高效但危险； - use SAX parser, efficient too but confusing and difficult to maintain; -使用SAX解析器，效率也很高，但令人困惑且难以维护； - use DOM parser, less efficient but clearer... Consider using lxml package (@see: http://lxml.de/ ) -使用DOM解析器，效率较低，但更清晰...考虑使用lxml包（@see： http : //lxml.de/ ）

It is dangerous, because the way you read the file, you don't care of the XML encoding. 这很危险，因为您读取文件的方式并不关心XML编码。 And it is bad! 这很糟糕！ Very bad indeed! 确实非常糟糕！ XML files are usually encoded in UTF-8. XML文件通常以UTF-8编码。 You should first decode UTF-8 byte stream. 您应该首先解码UTF-8字节流。 A simple way to do that is to use codecs.open() to open an encoded file. 一种简单的方法是使用codecs.open（）打开编码的文件。

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()

With this solution, the full XML content is store in the content variable as an Unicode string. 使用此解决方案，完整的XML内容作为Unicode字符串存储在content变量中。 You can then use a Unicode regex to parse the content. 然后，您可以使用Unicode正则表达式来解析内容。

Finally, you can use a set intersection to find if a given XML file contains commons names with the text file. 最后，您可以使用交集来查找给定的XML文件是否包含文本文件的公用名称。

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()
    actual_set = set(pattern.findall(content))
    print(not_used_set & actual_set)

相互匹配两个文件并将输出作为文件写入-Python

问题描述

1 个解决方案

解决方案1
5 已采纳 2013-10-28 22:40:30

PEP8: Style Guide for Python Code PEP8：Python代码样式指南

File objects and `with` statement 文件对象和`with`语句

Escape Windows filenames 转义Windows文件名

Compile your regex 编译您的正则表达式

Parse a XML file 解析XML文件

相互匹配两个文件并将输出作为文件写入-Python

问题描述

1 个解决方案

解决方案1 5 已采纳 2013-10-28 22:40:30

PEP8: Style Guide for Python Code PEP8：Python代码样式指南

File objects and with statement 文件对象和with语句

Escape Windows filenames 转义Windows文件名

Compile your regex 编译您的正则表达式

Parse a XML file 解析XML文件

解决方案1
5 已采纳 2013-10-28 22:40:30

File objects and `with` statement 文件对象和`with`语句