從pycharm中的重組文本文件中刪除面包屑

Question

我大約需要刪除13,000個文件。 每個文件開頭的模式大致如下：

Title
=====

| |image0| `link <link1.html>`__ |image1| ::
  `link2 <link2.html>`__ ::
  `link3 <link3.html>`__
| **Introduced** : VersionXXX

但是，在某些文件中，標題行和最后一行之間的部分是2或4，具體取決於樹的深度。 無論標題行和此處顯示的最后一行之間有多行，我都希望將中間部分完全刪除。 我不太清楚如何做到這一點，不勝感激。 我正在使用pycharm，並且它們有一個正則表達式工具（到目前為止還沒有成功），但我同樣很高興使用sed或python等替代文件來遍歷文件。

預期結果：

Title
=====

| **Introduced** : VersionXXX

感謝所有出色的解決方案。 避免寫入單獨文件的最終解決方案 ：

import os

src_dir = '/PycharmProjects/docs/testfiles'
logf = open('failed_file_log.txt', 'w')

for filename in os.listdir(src_dir):
    print(filename)

    with open('{}/{}'.format(src_dir, filename), 'r') as f:
        lines = f.readlines()
    with open('{}/{}'.format(src_dir, filename), 'w') as f:
        try:
            for i in range(3):
                f.write(lines[i])
            copy = False
            for line in lines:
                if copy:
                    f.write(line)
                elif line.startswith('| **Introduced**'):
                    copy = True
                    f.write(line)
        except Exception as e:
            logf.write('Failed to rewrite {}'.format(filename))
        finally:
            pass

Answer 1

正如sed在OP中標記的問題一樣，以下是獲得所需結果的兩種方法：

sed -n  '/Title/{N;N;p}; /Introduced/{p}' input
Title
=====

| **Introduced** : VersionXXX

要么

awk ：

awk '/Title/{print;getline;print;getline;print}/Introduced/{print}' input
Title
=====

| **Introduced** : VersionXXX

Answer 2

由於您正在尋找大多數固定模式，因此我將使用不帶正則表達式的Python復制文件。 該過程非常簡單：復制前三行，然后跳過所有內容，直到到達| **Introduced** | **Introduced** ，然后將其余內容復制過來。

with open('myfile.rst') as fin, open('myfile_out.rst') as fout:
    for _ in range(3):
        fout.write(next(fin))
    copy = False
    for line in fin:
        if copy:
            fout.write(line)
        elif line.startswith('| **Introduced**'):
            copy = True
            fout.write(line)

將此練習應用於文件層次結構，並將輸出移回輸入名稱，這是讀者的一項練習。

Answer 3

您可以使用2個捕獲組，並通過使用重復模式來匹配介於兩者之間的內容，該模式使用負前瞻(?!

然后在替換中使用這兩個組，在python中使用re.sub ，替換將是r'\\1\\2' 。

(\bTitle\n=+\n)(?:\n(?!\| \*\*).*)*(\n\| \*\*Introduced\*\* : Version.*)

說明

(\\bTitle\\n=+\\n)捕獲組1，匹配標題，換行符，a +和換行符1+次
(?:非捕獲組
- \\n(?!\\| \\*\\*).*匹配換行符並斷言直接在右邊的不是| ** | **使用負前瞻。 然后匹配0+次除換行符以外的任何字符
)*關閉非捕獲組並重復0次以上
(\\n\\| \\*\\*Introduced\\*\\* : Version.*)捕獲組2，匹配換行符和匹配最后一行的模式

正則表達式演示

Answer 4

此表達式使用三個捕獲組，而我們不想要的部分在第二個中，我們可以簡單地替換它（ $1$3 ）。

(.+\s*=====\s*)([\s\S]*)(\|\s+\*\*Introduced\*\* : .+)

演示

測試

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(.+\s*=====\s*)([\s\S]*)(\|\s+\*\*Introduced\*\* : .+)"

test_str = ("Title\n"
    "=====\n\n"
    "| |image0| `link <link1.html>`__ |image1| ::\n"
    "  `link2 <link2.html>`__ ::\n"
    "  `link3 <link3.html>`__\n"
    "| **Introduced** : VersionXXX")

subst = "\\1\\3"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Answer 5

sed有其用途，但需要瘋狂的技能才能按需進行多行處理。 這是久經考驗的* nix文本處理語言awk的替代方法;-)

**cleanup.awk**
#!/bin/awk -f
{
  # print "dbg:$0="$0
}
/^$/{
  print $0
  inside_unneeded=1;
}
{
  if ($0 ~ /^\| \*\*Introduced\*\*/) {
    print $0
    inside_unneeded=0
  }
  else if (! inside_unneeded) {
    print $0
  }

你需要

chmod 755 cleanup.awk

並運行為

cleanup.awk file > file.new && /bin/rm file

如果您有能力保留備份（推薦），請執行&& mv file file.sav && mv file.new file 。 或者，您可以重定向到另一個目錄，然后不必處理任何&&處理，即。 cleanup.awk file > /alt/path/for/new/data/file 。

將產生輸出

Title
=====

| **Introduced** : VersionXXX

有可能使用awk速記邏輯來大大減少此腳本的大小的方法，但對於熟悉if/else if/else類型邏輯的一般公眾，我將其保持在可解碼的狀態。

所有塊（ { ... }之間的代碼都針對輸入的每一行執行，而以/^$/開頭的塊僅針對空行進行處理。如果這些空行上有空格，則需要/^[ <tab>]*$/{代替（不要鍵入<tab> ，從鍵盤上插入一個普通的tab ）。

IHTH。

從pycharm中的重組文本文件中刪除面包屑

問題描述

5 個解決方案

解決方案1
2 2019-06-11 18:51:39

解決方案2
1 已采納 2019-06-11 16:43:16

解決方案3
1 2019-06-11 18:47:25

解決方案4
0 2019-06-11 16:41:52

演示

測試

解決方案5
0 2019-06-11 19:32:41

從pycharm中的重組文本文件中刪除面包屑

問題描述

5 個解決方案

解決方案1 2 2019-06-11 18:51:39

解決方案2 1 已采納 2019-06-11 16:43:16

解決方案3 1 2019-06-11 18:47:25

解決方案4 0 2019-06-11 16:41:52

演示

測試

解決方案5 0 2019-06-11 19:32:41

解決方案1
2 2019-06-11 18:51:39

解決方案2
1 已采納 2019-06-11 16:43:16

解決方案3
1 2019-06-11 18:47:25

解決方案4
0 2019-06-11 16:41:52

解決方案5
0 2019-06-11 19:32:41