简体   繁体   中英

Remove urls, empty lines, and unicode characters in python

I need to remove url, empty lines and lines with unicode characters from a big text file (500MiB) using python.

This is my file:

https://removethis1.com
http://removethis2.com foobar1
http://removethis3.com foobar2
foobar3 http://removethis4.com
www.removethis5.com


foobar4 www.removethis6.com foobar5
foobar6 foobar7
foobar8 www.removethis7.com

After the regex it should look like this:

foobar1
foobar2
foobar3 
foobar4 foobar5
foobar6 foobar7
foobar8

The code I come up is this:

    file = open(file_path, encoding="utf8")
    self.rawFile = file.read()
    rep = re.compile(r"""
                        http[s]?://.*?\s 
                        |www.*?\s  
                        |(\n){2,}  
                        """, re.X)
    self.processedFile = rep.sub('', self.rawFile)

But the output is incorrect:

foobar3 foobar4 foobar5
foobar6 foobar7
foobar8 www.removethis7.com

I also need to remove all the lines containing at least one non-ascii char but I can't come up with a regex for this task.

You can try to encode to ascii to catch non ascii lines which I presume is what you want:

with open("test.txt",encoding="utf-8") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    for line in f:
        m = rep.search(line)
        try:
            if m:
                line = line.replace(m.group(), "")
                line.encode("ascii")
        except UnicodeEncodeError:
            continue
        if line.strip():
            print(line.strip())

input:

https://removethis1.com
http://removethis2.com foobar1
http://removethis3.com foobar2
foobar3 http://removethis4.com
www.removethis5.com

1234 ā
5678 字
foobar4 www.removethis6.com foobar5
foobar6 foobar7
foobar8 www.removethis7.com

Output:

foobar1
foobar2
foobar3
foobar4 foobar5
foobar6 foobar7
foobar8

Or using a regex to match any non ascii:

with open("test.txt",encoding="utf-8") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    non_asc = re.compile(r"[^\x00-\x7F]")
    for line in f:
        non = non_asc.search(line)
        if non:
            continue
        m = rep.search(line)
        if m:
            line = line.replace(m.group(), "")
            if line.strip():
                print(line.strip())

Same output as above. You cannot combine the regexes as your are removing lines completely with one if there is any match and just replacing with the other.

this will remove all the links

(?:http|www).*?(?=\s|$)

explanation

(?:            #non capturing group
    http|www   #match "http" OR "www"
)
    .*?        #lazy match anything until...
(
    ?=\s|$     #it is followed by white space or the end of line (positive lookahead)
)

Replace white space \\s with newline \\n then strip out all empty lines after

Depending on how close to your sample text you want the result to match:

( +)?\b(?:http|www)[^\s]*(?(1)|( +)?)|\n{2,}

regex101 demo

This magic looks for leading spaces and captures them if present. Then it looks for the http or www portion, followed by everything not whitespace (I used [^\\s]* instead of simply \\S* in case you wanted to add more criteria to exclude). After that, it uses a regex conditional to check whether any whitespace had been collected earlier. If it did NOT, then it tries to capture any trailing whitespace (so you don't remove too much between foobar4 www.removethis6.com foobar5 for example). Or it looks for 2+ newlines.

If you replace all of that with nothing, it should give you the same output that you were requesting.

Now, this regex is fairly rigid and will likely have many edge cases in which it does not work. This works for the OP, but you may need to provide more details if you need it to be more flexible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM