在python中刪除網址，空行和unicode字符

Question

我需要使用python從大文本文件（500MiB）中刪除url，空行和帶有unicode字符的行。

這是我的文件：

https://removethis1.com
http://removethis2.com foobar1
http://removethis3.com foobar2
foobar3 http://removethis4.com
www.removethis5.com


foobar4 www.removethis6.com foobar5
foobar6 foobar7
foobar8 www.removethis7.com

在正則表達式之后，它應如下所示：

foobar1
foobar2
foobar3 
foobar4 foobar5
foobar6 foobar7
foobar8

我出現的代碼是這樣的：

    file = open(file_path, encoding="utf8")
    self.rawFile = file.read()
    rep = re.compile(r"""
                        http[s]?://.*?\s 
                        |www.*?\s  
                        |(\n){2,}  
                        """, re.X)
    self.processedFile = rep.sub('', self.rawFile)

但是輸出不正確：

foobar3 foobar4 foobar5
foobar6 foobar7
foobar8 www.removethis7.com

我還需要刪除所有包含至少一個非ascii字符的行，但是我無法為該任務准備一個正則表達式。

Answer 1

您可以嘗試編碼為ascii來捕獲非ascii行，我想這就是您想要的：

with open("test.txt",encoding="utf-8") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    for line in f:
        m = rep.search(line)
        try:
            if m:
                line = line.replace(m.group(), "")
                line.encode("ascii")
        except UnicodeEncodeError:
            continue
        if line.strip():
            print(line.strip())

輸入：

https://removethis1.com
http://removethis2.com foobar1
http://removethis3.com foobar2
foobar3 http://removethis4.com
www.removethis5.com

1234 ā
5678 字
foobar4 www.removethis6.com foobar5
foobar6 foobar7
foobar8 www.removethis7.com

輸出：

foobar1
foobar2
foobar3
foobar4 foobar5
foobar6 foobar7
foobar8

或使用正則表達式匹配任何非ascii：

with open("test.txt",encoding="utf-8") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    non_asc = re.compile(r"[^\x00-\x7F]")
    for line in f:
        non = non_asc.search(line)
        if non:
            continue
        m = rep.search(line)
        if m:
            line = line.replace(m.group(), "")
            if line.strip():
                print(line.strip())

與上面相同的輸出。 您無法將正則表達式組合在一起，因為如果有任何匹配項，您將完全用一行刪除行，而僅用另一行替換。

Answer 2

這將刪除所有鏈接

(?:http|www).*?(?=\s|$)

說明

(?:            #non capturing group
    http|www   #match "http" OR "www"
)
    .*?        #lazy match anything until...
(
    ?=\s|$     #it is followed by white space or the end of line (positive lookahead)
)

用換行符\\n替換空格\\s ，然后在空格之后去除所有空行

Answer 3

您希望結果與示例文本的接近程度取決於：

( +)?\b(?:http|www)[^\s]*(?(1)|( +)?)|\n{2,}

regex101演示

這種魔力尋找前導空間並捕獲它們（如果存在）。 然后，它會查找http或www部分，然后是所有非空格（如果要添加更多條件以排除的話，我使用[^\\s]*而不是簡單的\\S* ）。 之后，它使用一個正則表達式條件來檢查是否較早地收集了任何空格。 如果不是，則它嘗試捕獲任何尾隨空格（例如，您不要在foobar4 www.removethis6.com foobar5之間刪除太多）。 或尋找2條以上的換行符。

如果將所有內容全部替換為空，它應該會提供與您所請求的輸出相同的輸出。

現在，此正則表達式相當僵化，可能會在許多無法使用的情況下出現。 這適用於OP，但如果您需要使其更加靈活，則可能需要提供更多詳細信息。

在python中刪除網址，空行和unicode字符

問題描述

3 個解決方案

解決方案1
1 已采納 2015-09-25 17:43:24

解決方案2
-1 2015-09-25 17:13:48

解決方案3
-1 2015-09-25 17:24:14

在python中刪除網址，空行和unicode字符

問題描述

3 個解決方案

解決方案1 1 已采納 2015-09-25 17:43:24

解決方案2 -1 2015-09-25 17:13:48

解決方案3 -1 2015-09-25 17:24:14

解決方案1
1 已采納 2015-09-25 17:43:24

解決方案2
-1 2015-09-25 17:13:48

解決方案3
-1 2015-09-25 17:24:14