如何查找所有文本文件，無論擴展名只包含逗號和數字？

Question

我必須搜索可能具有任何擴展名的文件。 所有這些文件的特殊屬性是它們長度少於五行（小於4 \\ n \\ r），除了換行符之外，所有字符都是數字，空格和逗號。 如何根據內容編寫搜索文件的代碼？

我很清楚這需要很長時間才能運行。

我的項目不需要Java或Python，我只是提到它們，因為我對它們比較熟悉。 Powershell是值得推薦的。

我正在運行Windows 7系統。

Answer 1

像下面這樣的東西應該工作：

valid_chars = set('0123456789, \r\n')
for root, dirs, files in os.walk(base):
    for fname in files:
        fpath = os.path.join(root, fname)
        with open(fpath, 'rb') as f:
            lines = []
            for i, line in enumerate(f):
                if i >= 5 or not all(c in valid_chars for c in line):
                    break
            else:
                print 'found file: ' + fpath

您可以使用正則表達式而not all(c in valid_chars for c in line) ：

            ...
                if i >= 5 or not re.match(r'[\d, \r\n]*$', line):
            ...

如果你使用正則表達式，為了提高效率，請在循環之外使用re.compile 。

Answer 2

import os

expected_chars = set(' ,1234567890\n\r')
nlines = 5
max_file_size = 1000  # ignore file longer than 1000bytes, this will speed things up


def process_dir(out, dirname, fnames):
    for fname in fnames:
    fpath = os.path.join(dirname, fname)

    if os.path.isfile(fpath):

        statinfo = os.stat(fpath)

        if statinfo.st_size < max_file_size: 
            with open(fpath) as f:
                # read the first n lines
                firstn = [ f.readline() for _ in range(nlines)]

                # if there are any more lines left this is not our file
                if f.readline():
                    continue

                # if the first n lines contain only spaces, commas, digits and new lines
                # this is our kind of file add it to the results.
                if not set(''.join(firstn)) - expected_chars:
                    out.append(fpath)


out = []
path.walk("/some/path/", process_dir, out)

Answer 3

你可以使用grep -r和-l選項。 -r允許您在所有文件的目錄中遞歸搜索， -l僅打印其內容與正則表達式匹配的文件的名稱。

grep -r -l '\A([0-9, ]+\s){1,4}[0-9, ]+\Z' directory

這將打印少於5行數字，空格或逗號字符的所有文件的名稱列表。

\\ A和\\ Z將檢查主題文本的開頭和結尾。 [0-9, ]+查找一系列數字，空格或逗號，后跟\\s ，它是換行符，空格或回車符。 該序列可以重復最多4次，由{1,4}表示，然后是另一行數據。

Answer 4

在Python中（我只會概述步驟，以便您可以自己編程。但當然可以隨意詢問您是否可以解決問題）：

使用os.path.walk查找所有文件（它為您提供所有文件，無論其擴展名如何）。
請注意，它還為您提供目錄等，因此請使用os.path.isfile跳過它們。
對於每個文件：
- 打開它（ open ）。 在with語句中執行以下操作以避免必須手動關閉文件。
- 你可以先計算行數，然后檢查逗號，但這可能比較慢，所以：
- 逐行讀取文件。 對於每一行，做兩件事：
- 計算線條。 如果您到達5，請繼續下一個文件。
- 檢查它是否與逗號標准匹配。 我會使用regular expression 。 如果不匹配，繼續。
- 如果您位於文件的末尾，則表示您已成功，因此您可以打印文件名或任何您想要的內容。

如何查找所有文本文件，無論擴展名只包含逗號和數字？

問題描述

4 個解決方案

解決方案1
1 2012-10-23 22:12:11

解決方案2
1 2012-10-23 22:26:06

解決方案3
1 2012-10-24 04:39:46

解決方案4
0 2012-10-23 22:13:45

如何查找所有文本文件，無論擴展名只包含逗號和數字？

問題描述

4 個解決方案

解決方案1 1 2012-10-23 22:12:11

解決方案2 1 2012-10-23 22:26:06

解決方案3 1 2012-10-24 04:39:46

解決方案4 0 2012-10-23 22:13:45

解決方案1
1 2012-10-23 22:12:11

解決方案2
1 2012-10-23 22:26:06

解決方案3
1 2012-10-24 04:39:46

解決方案4
0 2012-10-23 22:13:45