简体   繁体   English

Python检查gzip压缩的文件是xml还是csv

[英]Python to check if a gzipped file is xml or csv

I have a script to pull in a variety of gzip and bz2 compressed files. 我有一个脚本可以提取各种gzip和bz2压缩文件。 After I pull them in, I am looking to write a script to write the file and add an extension based on the file type contained within. 将它们插入后,我希望编写一个脚本来写入文件并根据其中包含的文件类型添加扩展名。

The file formats I am concern about include xml, csv, and txt files, although I am not really concerned about delineating between csv and txt files (adding the txt extension is alright for both). 我关心的文件格式包括xml,csv和txt文件,尽管我并不真正关心在csv和txt文件之间进行区分(添加txt扩展名对这两种方法都可以)。

I have been using the python-magic library to determine which decompression library to use (bz2 vs gzip) but want to know what the easiest way to determine the file type. 我一直在使用python-magic库来确定要使用的解压缩库(bz2 vs gzip),但想知道确定文件类型的最简单方法是什么。 Using python-magic I got: 使用python-magic我得到了:

>>> ftype = m.from_file("xml_test.xml")
>>> ftype
'ASCII text'
>>> ftype = m.from_file("csv_test.csv")
>>> ftype
'ASCII text'

My current plan is to read in the first line of each file and make the determination based on that. 我当前的计划是读取每个文件的第一行,然后根据该行进行确定。 Is there an easier way? 有更容易的方法吗?

In response to @phihag's answer showing me how poorly I originally worded this question: What I want is something that will check first if a file is valid XML, if not then check if it is valid CSV, and finally if it is not valid CSV but is valid plain text, return that as a response 回应@phihag的答案时,我最初对这个问题的回答是很糟糕的:我想要的是一种将首先检查文件是否为有效XML的东西,如果不是,则首先检查其是否为CSV,最后检查其是否为无效CSV。但为有效的纯文本,请作为响应返回

Note: There was a partial answer here but this solution only describes csv check, not an xml, txt, etc. 注意:这里有部分答案但是此解决方案仅描述了csv检查,而不是xml,txt等。

You cannot reliably distinguish XML and csv, as the following file is both a valid XML as well as a valid CSV document: 您不能可靠地区分XML和csv,因为以下文件既是有效的XML也是有效的CSV文档:

<r>,</r>

Therefore, all you can do is apply a heuristic, for example return xml if the first character is < , and csv otherwise. 因此,您所能做的就是应用启发式方法,例如,如果第一个字符是< ,则返回xml,否则返回csv。

Similarily, all CSV and XML files are also valid plain text files. 同样,所有CSV和XML文件也是有效的纯文本文件。

To check whether a file forms a valid XML or CSV document, you can simply parse it. 要检查文件是否形成有效的XML或CSV文档,您可以简单地对其进行解析。 If you're out for performance, simply skip the construction of an actual document tree, for example with sax or by ignoring the items of csv.reader : 如果您出于性能考虑,只需跳过构建实际文档树的步骤,例如使用sax或忽略csv.reader的各项

import xml.sax,csv
def getType(filename):
  with open(filename, 'rb') as fh:
    try:
      xml.sax.parse(fh, xml.sax.ContentHandler())
      return 'xml'
    except: # SAX' exceptions are not public
      pass
    fh.seek(0)

    try:
      for line in csv.reader(fh):
        pass
      return 'csv'
    except csv.Error:
      pass

    return 'txt'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM