I have a text file which contains some strings that I want to remove from my data frame. The data frame observations contains those texts which are present in the ext file.
here is the text file - https://drive.google.com/open?id=1GApPKvA82tx4CDtlOTqe99zKXS3AHiuD
here is the link; Data = https://drive.google.com/open?id=1HJbWTUMfiBV54EEtgSXTcsQLzQT1rFgz
I am using the following code -
import nltk
from nltk.tokenize import word_tokenize
file = open("D://Users/Shivam/Desktop/rahulB/fliter.txt")
result = file.read()
words = word_tokenize(result)
I loaded the text files and converted them into words/tokens.
Its is my dataframe.
text
0 What Fresh Hell Is This? January 31, 2018 ...A...
1 What Fresh Hell Is This? February 27, 2018 My ...
2 What Fresh Hell Is This? March 31, 2018 Trump ...
3 What Fresh Hell Is This? April 29, 2018 Michel...
4 Join Email List Contribute Join AMERICAblog Ac...
If you see this, these texts are present in the all rows such as "What Fresh Hell Is This?" or "Join Email List Contribute Join AMERICAblog Ac, "Sign in Daily Roundup MS Legislature Elected O" etc.
I used this for loop
for word in words:
df['text'].replace(word, ' ')
my error.
error Traceback (most recent call last)
<ipython-input-168-6e0b8109b76a> in <module>()
----> 1 df['text'] = df['text'].str.replace("|".join(words), " ")
D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags)
1577 def replace(self, pat, repl, n=-1, case=None, flags=0):
1578 result = str_replace(self._data, pat, repl, n=n, case=case,
-> 1579 flags=flags)
1580 return self._wrap_result(result)
1581
D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in str_replace(arr, pat, repl, n, case, flags)
422 if use_re:
423 n = n if n >= 0 else 0
--> 424 regex = re.compile(pat, flags=flags)
425 f = lambda x: regex.sub(repl=repl, string=x, count=n)
426 else:
D:\Users\Shivam\Anaconda2\lib\re.pyc in compile(pattern, flags)
192 def compile(pattern, flags=0):
193 "Compile a regular expression pattern, returning a pattern object."
--> 194 return _compile(pattern, flags)
195
196 def purge():
D:\Users\Shivam\Anaconda2\lib\re.pyc in _compile(*key)
249 p = sre_compile.compile(pattern, flags)
250 except error, v:
--> 251 raise error, v # invalid expression
252 if not bypass_cache:
253 if len(_cache) >= _MAXCACHE:
error: nothing to repeat
You can use str.replace
Ex:
df['text'] = df['text'].str.replace("|".join(words), " ")
You can modify your code in this way:
for word in words:
df['text'] = df['text'].str.replace(word, ' ')
You may use
df['text'] = df['text'].str.replace(r"\s*(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])), " ")
The r"(?<!\\w)(?:{})(?!\\w)".format("|".join([re.escape(x) for x in words]))
line will perform these steps:
re.escape(x) for x in words]
- will escape all special chars in the words to be used with regex safely "|".join([...)
- will create alternations that will be matched by regex engine r"\\s*(?<!\\w)(?:{})(?!\\w)".format(....)
- will create a regex like \\s*(?<!\\w)(?:word1|word2|wordn)(?!\\w)
that will match words as whole words from the list ( \\s*
will also remove 0+ whitespaces before the words).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.