简体   繁体   中英

removing words from a list from pandas column - python 2.7

I have a text file which contains some strings that I want to remove from my data frame. The data frame observations contains those texts which are present in the ext file.

here is the text file - https://drive.google.com/open?id=1GApPKvA82tx4CDtlOTqe99zKXS3AHiuD

here is the link; Data = https://drive.google.com/open?id=1HJbWTUMfiBV54EEtgSXTcsQLzQT1rFgz

I am using the following code -

import nltk
from nltk.tokenize import word_tokenize 
file = open("D://Users/Shivam/Desktop/rahulB/fliter.txt")
result = file.read()
words = word_tokenize(result)

I loaded the text files and converted them into words/tokens.

Its is my dataframe.

text
0   What Fresh Hell Is This? January 31, 2018 ...A...
1   What Fresh Hell Is This? February 27, 2018 My ...
2   What Fresh Hell Is This? March 31, 2018 Trump ...
3   What Fresh Hell Is This? April 29, 2018 Michel...
4   Join Email List Contribute Join AMERICAblog Ac...

If you see this, these texts are present in the all rows such as "What Fresh Hell Is This?" or "Join Email List Contribute Join AMERICAblog Ac, "Sign in Daily Roundup MS Legislature Elected O" etc.

I used this for loop

for word in words:
    df['text'].replace(word, ' ')

my error.

error                                     Traceback (most recent call last)
<ipython-input-168-6e0b8109b76a> in <module>()
----> 1 df['text'] = df['text'].str.replace("|".join(words), " ")

D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags)
   1577     def replace(self, pat, repl, n=-1, case=None, flags=0):
   1578         result = str_replace(self._data, pat, repl, n=n, case=case,
-> 1579                              flags=flags)
   1580         return self._wrap_result(result)
   1581 

D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in str_replace(arr, pat, repl, n, case, flags)
    422     if use_re:
    423         n = n if n >= 0 else 0
--> 424         regex = re.compile(pat, flags=flags)
    425         f = lambda x: regex.sub(repl=repl, string=x, count=n)
    426     else:

D:\Users\Shivam\Anaconda2\lib\re.pyc in compile(pattern, flags)
    192 def compile(pattern, flags=0):
    193     "Compile a regular expression pattern, returning a pattern object."
--> 194     return _compile(pattern, flags)
    195 
    196 def purge():

D:\Users\Shivam\Anaconda2\lib\re.pyc in _compile(*key)
    249         p = sre_compile.compile(pattern, flags)
    250     except error, v:
--> 251         raise error, v # invalid expression
    252     if not bypass_cache:
    253         if len(_cache) >= _MAXCACHE:

error: nothing to repeat

You can use str.replace

Ex:

df['text'] = df['text'].str.replace("|".join(words), " ")

You can modify your code in this way:

for word in words:
     df['text'] = df['text'].str.replace(word, ' ')

You may use

df['text'] = df['text'].str.replace(r"\s*(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])), " ")

The r"(?<!\\w)(?:{})(?!\\w)".format("|".join([re.escape(x) for x in words])) line will perform these steps:

  • re.escape(x) for x in words] - will escape all special chars in the words to be used with regex safely
  • "|".join([...) - will create alternations that will be matched by regex engine
  • r"\\s*(?<!\\w)(?:{})(?!\\w)".format(....) - will create a regex like \\s*(?<!\\w)(?:word1|word2|wordn)(?!\\w) that will match words as whole words from the list ( \\s* will also remove 0+ whitespaces before the words).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM