简体   繁体   English

从熊猫列中的列表中删除单词-python 2.7

[英]removing words from a list from pandas column - python 2.7

I have a text file which contains some strings that I want to remove from my data frame. 我有一个文本文件,其中包含一些要从数据框中删除的字符串。 The data frame observations contains those texts which are present in the ext file. 数据框观察结果包含ext文件中存在的那些文本。

here is the text file - https://drive.google.com/open?id=1GApPKvA82tx4CDtlOTqe99zKXS3AHiuD 这是文本文件-https://drive.google.com/open?id=1GApPKvA82tx4CDtlOTqe99zKXS3AHiuD

here is the link; 链接在这里; Data = https://drive.google.com/open?id=1HJbWTUMfiBV54EEtgSXTcsQLzQT1rFgz 数据= https://drive.google.com/open?id=1HJbWTUMfiBV54EEtgSXTcsQLzQT1rFgz

I am using the following code - 我正在使用以下代码-

import nltk
from nltk.tokenize import word_tokenize 
file = open("D://Users/Shivam/Desktop/rahulB/fliter.txt")
result = file.read()
words = word_tokenize(result)

I loaded the text files and converted them into words/tokens. 我加载了文本文件,并将其转换为单词/标记。

Its is my dataframe. 这是我的数据框。

text
0   What Fresh Hell Is This? January 31, 2018 ...A...
1   What Fresh Hell Is This? February 27, 2018 My ...
2   What Fresh Hell Is This? March 31, 2018 Trump ...
3   What Fresh Hell Is This? April 29, 2018 Michel...
4   Join Email List Contribute Join AMERICAblog Ac...

If you see this, these texts are present in the all rows such as "What Fresh Hell Is This?" 如果看到此信息,这些文本将显示在所有行中,例如“这是什么新鲜的地狱?” or "Join Email List Contribute Join AMERICAblog Ac, "Sign in Daily Roundup MS Legislature Elected O" etc. 或“加入电子邮件列表,参与加入AMERICAblog Ac,”登录每日摘要MS立法机关当选O”等。

I used this for loop 我用这个循环

for word in words:
    df['text'].replace(word, ' ')

my error. 我的错误。

error                                     Traceback (most recent call last)
<ipython-input-168-6e0b8109b76a> in <module>()
----> 1 df['text'] = df['text'].str.replace("|".join(words), " ")

D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags)
   1577     def replace(self, pat, repl, n=-1, case=None, flags=0):
   1578         result = str_replace(self._data, pat, repl, n=n, case=case,
-> 1579                              flags=flags)
   1580         return self._wrap_result(result)
   1581 

D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in str_replace(arr, pat, repl, n, case, flags)
    422     if use_re:
    423         n = n if n >= 0 else 0
--> 424         regex = re.compile(pat, flags=flags)
    425         f = lambda x: regex.sub(repl=repl, string=x, count=n)
    426     else:

D:\Users\Shivam\Anaconda2\lib\re.pyc in compile(pattern, flags)
    192 def compile(pattern, flags=0):
    193     "Compile a regular expression pattern, returning a pattern object."
--> 194     return _compile(pattern, flags)
    195 
    196 def purge():

D:\Users\Shivam\Anaconda2\lib\re.pyc in _compile(*key)
    249         p = sre_compile.compile(pattern, flags)
    250     except error, v:
--> 251         raise error, v # invalid expression
    252     if not bypass_cache:
    253         if len(_cache) >= _MAXCACHE:

error: nothing to repeat

You can use str.replace 您可以使用str.replace

Ex: 例如:

df['text'] = df['text'].str.replace("|".join(words), " ")

You can modify your code in this way: 您可以通过以下方式修改代码:

for word in words:
     df['text'] = df['text'].str.replace(word, ' ')

You may use 您可以使用

df['text'] = df['text'].str.replace(r"\s*(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])), " ")

The r"(?<!\\w)(?:{})(?!\\w)".format("|".join([re.escape(x) for x in words])) line will perform these steps: r"(?<!\\w)(?:{})(?!\\w)".format("|".join([re.escape(x) for x in words]))行将执行这些操作。脚步:

  • re.escape(x) for x in words] - will escape all special chars in the words to be used with regex safely re.escape(x) for x in words] -将安全地与正则表达式一起使用的单词中的所有特殊字符转义
  • "|".join([...) - will create alternations that will be matched by regex engine "|".join([...) -将创建由正则表达式引擎匹配的替代项
  • r"\\s*(?<!\\w)(?:{})(?!\\w)".format(....) - will create a regex like \\s*(?<!\\w)(?:word1|word2|wordn)(?!\\w) that will match words as whole words from the list ( \\s* will also remove 0+ whitespaces before the words). r"\\s*(?<!\\w)(?:{})(?!\\w)".format(....) -将创建像\\s*(?<!\\w)(?:word1|word2|wordn)(?!\\w)将单词与列表中的整个单词匹配( \\s*还将删除单词之前的0+空格)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM