简体   繁体   中英

Python: How remove punctuation in text corpus, but not remove it in special words (e.g. c++, c#, .net, etc)

I have a big pandas dataset with job descriptions. I want to tokenize it, but before this I should remove stopwords and punctuation. I have no problems with stopwords.

If I will use regex for removing punctuation, I can lose very important words that describe jobs (eg c++ developer, c#, .net, etc.).

List of such important words is very big, because it consists not only programming languages names but also companies names.

For exmaple, I want the next way of removing punctuation:

Before:

Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;

After:

Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know c++ c# .NET in expert level

Can you advise me advance tockenizers or methods for removing punctuation?

You can use pattern:

[!,.:;-](?= |$)

To match any characters ! , , , . , : , ; and - that are followed by whitespace or end of string.


In Python:

import re
text = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;"
print (re.sub(r'[!,.:;-](?= |$)',r'',text))

Prints:

Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know  c++ c# .NET in expert level

My solution

def clean(s: str, keep=None, remove=None):
    """ delete punctuation from "s" except special words """
    if keep is None:
        keep = []

    if remove is None:
        remove = []

    protected = [False for _ in s]  # True if you keep

    # compute protected chars
    for w in keep:  # for every special word
        for i in range(len(s)-len(w)):
            same = True
            for j in range(len(w)):
                if w[j] != s[i + j]:
                    same = False
            if same:
                for j in range(len(w)):
                    protected[i + j] = True

    # delete unwanted chars
    out = ''
    for i in range(len(s)):
        if protected[i]:
            out += s[i]
        else:
            if s[i] not in remove:
                out += s[i]

    return out


if __name__ == "__main__":

    test = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know:" \
           " - c++, c# in expert level;"

    Remove = ['.', ',', ':', ';', '+', '-', '!', '?', '#']
    Keep = ['c++', 'c#']

    print(clean(test, keep=Keep, remove=Remove))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM