匹配文件名並替換為新名稱

Question

我有兩個帶有相似ID標簽的.txt文件。 我需要做的是從一個文件中獲取ID標記，在另一個文件中進行匹配，然后將ID替換為第一個文件中的名稱。 我需要為1000多個標簽完成此操作。 關鍵是要完全匹配第一個文件中ID標記名稱的一部分並替換它。

每行有一個唯一的ID標記，並且兩個文件之間始終完全匹配（對於位置[6-16] =“ 10737.G1C22”）； 匹配分散，因此File1.txt中的第1行可能與File2.txt中的第504行匹配
兩個文件中行的順序無法排序，必須保持

例如：

File1.txt = 
TYPE1_10737.G1C22 ---------
...

File2.txt = 
10737.G1C22 ----------

我需要File1.txt中的名稱，特別是“ 10737.G1C22”，以便在File2.txt中找到其完全匹配的內容並將其替換為“ TYPE1_10737.G1C22”。

然后，編輯看起來像這樣，現在File2.txt中的名稱根據File1.txt中的匹配項進行了更改：

 File2.txt = 
 TYPE1_10737.G1C22 ---------
 ...

我嘗試了一些sed函數，但遇到了麻煩。 重要的是，一旦找到完全匹配的名稱，就只能更改名稱的前6個字符，而不能更改其他任何內容。 有超過1000個以上的ID標簽需要匹配和更改。

我正在考慮告訴它精確匹配位置[6-16]並將其替換為File1.txt中的[0-16]的代碼。

任何幫助深表感謝。 這有可能嗎？ 我也歡迎其他建議。 謝謝。

Answer 1

Bash和`ed`解決方案

步驟1.創建File1.txt和File2.txt或多或少的樣子你進行實驗，並有一些樂趣（1000線）。 使用此腳本（在臨時目錄中）：

 #!/bin/bash declare -A table while ((${#table[@]}!=1000)); do key=$(mktemp -u XXXXXXXXXX) key=${key:0:5}.${key:5} table[${key^^}]=1 done { for key in "${!table[@]}"; do echo "TYPE1_$key some junk here" >&3 echo "$key some more junk here" done | shuf > File2.txt } 3> File1.txt

第2步。使用ed （標准編輯器）進行替換，該腳本包裝在該腳本中：
```
 #!/bin/bash ed -s File2.txt < <( while read l _; do p=${l:6} p=${p//./\\\\.} echo "%s/^$p/$l/" done < File1.txt echo wq ) 
```
假設您只有字母數字字符，下划線_和句點. 。 如果您還有其他字符，請適當修改（以免與正則表達式沖突）。

步驟3.檢查並享受：

 vimdiff <(sort File1.txt) <(sort File2.txt)

完成。

注意。 由於ed是一位真正的編輯，所以替換工作已經就位。 File2.txt確實已編輯。

嘿，等等，我可能忽略了16個字符的要求...我習慣了這樣的事實，即您的模式后面有一個空格。 如果在這一點上我的解決方案不好，請告訴我，我將對其進行適當的修改。

Answer 2

但是，請注意，基於Python的解決方案很簡單，因為它不能就地完成，您必須將結果存儲到某個新位置，例如tempfile 。

如果您的文件不是過大，即可以在內存中構造映射，則它很簡單（假設1）名稱與帶下划線的id分開，2）id與帶空格的文本分開，如示例3）這行同時包含ID和名稱4）file1）中每個ID僅存在一個名稱：

file1 = ('TYPE1_10737.G1C22 ---------', )
file2 = ('10737.G1C22 +++++++++++', )
id_name_gen = (l.split(' ', 1)[0] for l in file1)
id2name_mapping = {line.split('_', 1)[1]: line for line in id_name_gen}

有了映射后，替換即可輕松完成（如果找不到匹配項，則保持字符串不變）：

id_rest_gen = (l.split(' ', 1) for l in file2)
file2updated_gen = ('{} {}'.format(id2name_mapping.get(id, id), rest) for id, rest in file2)

>>> list(file2updated_gen)
['TYPE1_10737.G1C22 +++++++++++']

您只需要將生成的結果存儲到文件中即可。

Answer 3

python中的一個簡單解決方案：

from collections import OrderedDict
LINES_PER_CYCLE = 1000

with open('output.txt', 'wb') as output, open('test_2.txt', 'rb') as fin:
    fin_line = ''

    # Loop until fin reaches EOF.
    while True:
        cache = OrderedDict()

        # Fill the cache with up to LINES_PER_CYCLE entries.
        for _ in xrange(LINES_PER_CYCLE):
            fin_line = fin.readline()
            if not fin_line:
                break

            key, rest = fin_line.strip().split(' ', 1)
            cache[key] = ['', rest]

        # Loop over the file_1.txt to find tags with given id.    
        with open('test_1.txt', 'rb') as fout:
            for line in fout:
                tag, _ = line.split(' ', 1)
                _, idx = tag.rsplit('_', 1)
                if idx in cache:
                    cache[idx][0] = tag

        # Write matched lines to the output file, in the same order
        # as the lines were inserted into the cache.
        for _, (tag, rest) in cache.iteritems():
            output.write('{} {}\n'.format(tag, rest))

        # If fin has reached EOF, break.    
        if not fin_line:
            break

它的作用是讀取高達LINES_PER_CYCLE從條目file_2.txt ，找到匹配條目file_1.txt並寫入到輸出。 由於有限的內存（用於高速緩存），將file_1.txt搜索file_1.txt 。

這假定該標簽/ ID部分是由空格從分離------- ，並且標簽和ID是由一個下划線從自己，即分離。 'tag_idx等等等等'。

Answer 4

我將第一個文件加載到字典中，然后處理第二個文件以匹配鍵，將所有更改輸出到第三個文件：

import re

# Pattern to match in File1
pattern1 = "(\w+)_(\d+\.\w+)\s+.*$"

# Pattern to match in File2
pattern2 = "(\d+\.\w+)\s+.*$"

# Load the 'master' file into a dict,
# with the number as key and 'type' as value.
file1_dict = dict()
with open("File1.txt", "r") as f:
    for line in f.readlines():
        m = re.match(pattern1, line)
        if m:
            file1_dict[m.group(2)] = m.group(1)

# Open a new output file to replace File2.txt
with open("File3.txt", "w") as fnew:
    # As you process each line in File2.txt,
    # find matching entry in above File1 list.
    # Either write the old unmatched value or new
    # matching, changed value to File3.txt
    with open("File2.txt", "r") as f:
        for line in f.readlines():
            is_found = False
            m = re.match(pattern2, line)
            if m:
                if m.group(1) in file1_dict:
                    is_found = True
                    fnew.write("{0}_{1}".format(file1_dict[m.group(1)], line))
            if not is_found:
                fnew.write(line)

# Then just overwrite File2.txt with new File3.txt contents.

# Original File1.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
TYPE1_10799.G1C22 ---------

# Original File2.txt
10737.G1C22 ---------
10738.G1C22 ---------
10739.G1C22 ---------
10740.G1C22 ---------
10788.G1C22 ---------
10741.G1C22 ---------
10742.G1C22 ---------

# Results of new File3.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
10788.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------

匹配文件名並替換為新名稱

問題描述

4 個解決方案

解決方案1
1 2013-11-10 16:10:42

Bash和`ed`解決方案

解決方案2
1 2013-11-10 16:15:01

解決方案3
1 已采納 2013-11-10 17:16:17

解決方案4
1 2013-11-11 02:41:37

匹配文件名並替換為新名稱

問題描述

4 個解決方案

解決方案1 1 2013-11-10 16:10:42

Bash和ed解決方案

解決方案2 1 2013-11-10 16:15:01

解決方案3 1 已采納 2013-11-10 17:16:17

解決方案4 1 2013-11-11 02:41:37

解決方案1
1 2013-11-10 16:10:42

Bash和`ed`解決方案

解決方案2
1 2013-11-10 16:15:01

解決方案3
1 已采納 2013-11-10 17:16:17

解決方案4
1 2013-11-11 02:41:37