簡體   English   中英

如何消除Python中的重復字符串序列

[英]How to eliminate repeating sequence of string in Python

我有一項復雜的任務,即刪除重復的連續單詞或句子。 下面是一個示例輸入。

The
The Up
The Up next
The Up next we
The Up next we bring
The Up next we bring you
The Up next we bring you a
The Up next we bring you a rebroadcast
The Up next we bring you a rebroadcast of
The Up next we bring you a rebroadcast of.
of. The
of. The Diane
of. The Diane Rehm
of. The Diane Rehm radio
of. The Diane Rehm radio talk
of. The Diane Rehm radio talk show
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The
The Diane Rehm radio talk show. The program
The Diane Rehm radio talk show. The program is
The Diane Rehm radio talk show. The program is heard
The Diane Rehm radio talk show. The program is heard over
The Diane Rehm radio talk show. The program is heard over W.A.M.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M.
The program is heard over W.A.M. you F.M. on
The program is heard over W.A.M. you F.M. on the
The program is heard over W.A.M. you F.M. on the campus
The program is heard over W.A.M. you F.M. on the campus of
The program is heard over W.A.M. you F.M. on the campus of the
The program is heard over W.A.M. you F.M. on the campus of the American
F.M. on the campus of the American University
F.M. on the campus of the American University in
F.M. on the campus of the American University in the
F.M. on the campus of the American University in the nation's
F.M. on the campus of the American University in the nation's capital
F.M. on the campus of the American University in the nation's capital.
University in the nation's capital. The
University in the nation's capital. The special
University in the nation's capital. The special Martin
University in the nation's capital. The special Martin Luther
University in the nation's capital. The special Martin Luther King
University in the nation's capital. The special Martin Luther King Day
University in the nation's capital. The special Martin Luther King Day show
The special Martin Luther King Day show recorded
The special Martin Luther King Day show recorded Monday
The special Martin Luther King Day show recorded Monday.
recorded Monday. Focused
recorded Monday. Focused on
recorded Monday. Focused on race
recorded Monday. Focused on race relations
recorded Monday. Focused on race relations.
Focused on race relations. Ms
Focused on race relations. Ms Rames
Focused on race relations. Ms Rames guests
Focused on race relations. Ms Rames guests were
Focused on race relations. Ms Rames guests were Eleanor
Focused on race relations. Ms Rames guests were Eleanor Holmes
Ms Rames guests were Eleanor Holmes Norton
Ms Rames guests were Eleanor Holmes Norton.

電流輸出低於

The Up next we bring you a rebroadcast of.
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M. on the campus of the American
F.M. on the campus of the American University in the nation's capital.
University in the nation's capital. The special Martin Luther King Day show
The special Martin Luther King Day show recorded Monday.
recorded Monday. Focused on race relations.
Focused on race relations. Ms Rames guests were Eleanor Holmes
Ms Rames guests were Eleanor Holmes Norton.

如您所見,即使經過這個過程,我們仍然有重復,例如

The Up next we bring you a rebroadcast of.
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M. on the campus of the American

我只是想要像

The Up next we bring you a rebroadcast of.
The Diane Rehm radio talk show.
The program is heard over W.A.M. you F.M. on the campus of the American
University in the nation's capital. The special Martin Luther King Day show
recorded Monday. Focused on race relations.
...etc

我如何完成這個任務?

當前代碼

import os

def load_and_discard(file_path):
        """
        Load and discard previous substrings.

        Args:
                file_path (PathLike): path to data file

        Returns:
                list[str]
        """
        data = []
        with open("./input/"+infile_path) as f:
                for i, line in enumerate(f):
                        st = line.strip()
                        if i > 0 and st.startswith(data[-1]):
                                data[-1] = st
                        elif len(st) > 0:  # guard against empty string
                                data.append(st)
        return data

def find_lebms(s1, s2):
        """
        Binary search on the longest-end-begin-matching-substring (LEBMS).

        Args:
                s1 (str): 1st stripped str (match the end)
                s2 (str): 2nd stripped str (match the begin)

        Returns:
                int: length of LEBMS
        """

        # search up to this length
        n1 = min(len(s1), len(s2))

        for i in range(1, n1+1):
                if s1[-i:] == s2[:i]:
                        return i
                else:
                        return 0


def remove_repeated_substr(data):
        """
        Generate strings (in-place) ready for concatenation by
        removing the repeated substring in the first string.                                                                                                                                    
        Args:
                data (list[str]): list of strings

        Returns:
                None
        """

        n0 = len(data)
        for i, st in enumerate(data):

                # guard: no chopping for the last line
                if i == n0 - 1:
                        break

                # chop the current row
                n = find_lebms(st, data[i + 1])
                if n > 0:  # guard against n = 0
                        data[i] = st[:-n]

directory = './input'
for filename in os.listdir(directory):

        infile_path = filename

        data = load_and_discard(infile_path)
        remove_repeated_substr(data)

        # (optional) prevent un-spaced ending periods
        for i, st in enumerate(data):
                if st[-1] == ".":
                        data[i] += " "

        ans = "\n".join(data)
        with open("./output/"+filename, "w") as text_file:
                        text_file.write(ans)

如果您願意,可以將輸出用作輸入(如果更容易的話)。 所以你不必處理重復的行。 如果您想使用輸入作為您的輸入或我的輸出作為您的輸入,這完全取決於您。 但是當你發帖時,請告訴我。

替代輸入

You can watch a representative.
Twenty three zero seven of the Rayburn Office Building.
Washington D.C. each week. C.-SPAN
Washington D.C. each week. C.-SPAN breaks
Washington D.C. each week. C.-SPAN breaks from
Washington D.C. each week. C.-SPAN breaks from its
Washington D.C. each week. C.-SPAN breaks from its public
Washington D.C. each week. C.-SPAN breaks from its public affairs
C.-SPAN breaks from its public affairs programming
C.-SPAN breaks from its public affairs programming to
C.-SPAN breaks from its public affairs programming to give
C.-SPAN breaks from its public affairs programming to give the
C.-SPAN breaks from its public affairs programming to give the viewer
C.-SPAN breaks from its public affairs programming to give the viewer updated schedule information.
Join us at eight o'clock A.M. Eastern five o'clock A.M. Pacific Time.
Six thirty P.M. Eastern three thirty P.M. Pacific Time.
Eight o'clock P.M. Eastern five o'clock P.M. Pacific Time.
One o'clock A.M. Eastern ten o'clock P.M. Pacific Time. As always C.-SPAN
P.M. Pacific Time. As always C.-SPAN scheduled
P.M. Pacific Time. As always C.-SPAN scheduled programming
As always C.-SPAN scheduled programming is preempted by live coverage of the U.S. House of Representatives.
Going on this election year.
Covering every issue in the campaign calendar.
The calendar list the network's plans for campaign.
From now through election day.
In addition to election coverage.
Other major events are cameras record.
Call toll free one eight hundred three four six. Her it to order the C.-SPAN
four six. Her it to order the C.-SPAN update for
Her it to order the C.-SPAN update for twenty four dollars.
You can use your credit card or will be glad to send you a bill.
Call one eight hundred three four six eight hundred.
And you'll receive fifty issues of the C.-SPAN update.
If you order an update subscription now.
The receive a free gift. The C.-SPAN road to the White House
The C.-SPAN road to the White House poster is twenty two by twenty eight inch pen and ink drawing.
Attractively depicts the spans grassroots approach to the campaign called.

您可以使用此正則表達式使用前瞻和反向引用來匹配重疊的重復項並刪除它們。

(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+

使用空字符串進行替換。

正則表達式演示

代碼:

s = re.sub(r'(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+', '\n', s)

正則表達式詳情:

  • ( : 開始捕獲組 #1
    • \\b : 字邊界
    • [-\\w\\s.']+ :匹配 1+ 個單詞、空格、連字符、點或'字符
  • ) : 結束捕獲組 #1
  • (?=[\\s.]+\\1) :正向前瞻斷言在 1+ 個空格/點之后我們在使用前存在第 1 組捕獲的值
  • [\\s.]+ : 匹配 1+ 個空格或點

要保留多行,您可以使用 2 個替換:

s = re.sub(r'(\b[-\w\s.']+?)(?=[\s.]+\1)[\s.]+', '\n', s)
s = re.sub(r'\A\n+|(?<=[^.] )\n+|\n+(?=\n)|\n+\Z', '', s)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM