I have a big pandas dataset with job descriptions. I want to tokenize it, but before this I should remove stopwords and punctuation. I have no problems with stopwords.
If I will use regex for removing punctuation, I can lose very important words that describe jobs (eg c++ developer, c#, .net, etc.).
List of such important words is very big, because it consists not only programming languages names but also companies names.
For exmaple, I want the next way of removing punctuation:
Before:
Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;
After:
Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know c++ c# .NET in expert level
Can you advise me advance tockenizers or methods for removing punctuation?
You can use pattern:
[!,.:;-](?= |$)
To match any characters !
, ,
, .
, :
, ;
and -
that are followed by whitespace or end of string.
In Python:
import re
text = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;"
print (re.sub(r'[!,.:;-](?= |$)',r'',text))
Prints:
Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know c++ c# .NET in expert level
My solution
def clean(s: str, keep=None, remove=None):
""" delete punctuation from "s" except special words """
if keep is None:
keep = []
if remove is None:
remove = []
protected = [False for _ in s] # True if you keep
# compute protected chars
for w in keep: # for every special word
for i in range(len(s)-len(w)):
same = True
for j in range(len(w)):
if w[j] != s[i + j]:
same = False
if same:
for j in range(len(w)):
protected[i + j] = True
# delete unwanted chars
out = ''
for i in range(len(s)):
if protected[i]:
out += s[i]
else:
if s[i] not in remove:
out += s[i]
return out
if __name__ == "__main__":
test = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know:" \
" - c++, c# in expert level;"
Remove = ['.', ',', ':', ';', '+', '-', '!', '?', '#']
Keep = ['c++', 'c#']
print(clean(test, keep=Keep, remove=Remove))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.