从字符串的开头删除提供的字母列表

Question

I have a dataframe-df with column "Names" as below : 我有一个dataframe-df，列名为“Names”，如下所示：

Names
AL GHAITHA & AL MOOSA
AL ASEEL ELECTRONICS T
SUNRISE SUPERMARKET-QU
EMARAT-AL SAFIYAH(6735
LULU CENTRE LLC EFT TE
THE MAX

Code : 代码：

remove_letters = ['AL ', 'THE ']

# my function below :

def remove_start_words(df, col, letters):
    for l in letters:
        for i in df.index:
            x = df.at[i, col]
            if x.startswith(l):
                df.at[i, col] = x[len(l):]
            else:
                df.at[i, col] = x

def remove_strings(self, df, col):
    for i in df.index:
        x = df.at[i, col]
        x = x.split(' ')
        if len(x) > 1:
            if len(x[1]) > 2:
                x[1] = ''.join(e for e in x[1] if e.isalnum())
                x = ' '.join(x[0:2])
                df.at[i, col] = x
            else:
                df.at[i, col] = x[0]
        else:
            df.at[i, col] = df.at[i, col]


def remove_end_digits(self, df, col):
    for i in df.index:
        x = df.at[i, col]
        df.at[i, col] = x.rstrip(string.digits)

# calling my function
remove_start_words(df=df, col='Names',
                          letters=remove_letters)

remove_strings(df=df, col='Names')
remove_end_digits(df=df, col='Names')

Now the problem is i have a dataframe of more than 1 million column values. 现在问题是我有一个超过100万列值的数据帧。 My code is not well optimized ? 我的代码没有很好地优化？ How to get a optimized solution ? 如何获得优化的解决方案？

Issue 1 : I can understand i have used 2 loops ( 1 for remove_letters and other for all the column values) that is causing slowness. 问题1：我可以理解我已经使用了2个循环（1个用于remove_letters，其他用于所有列值），这导致了缓慢。

Is there a better way ? 有没有更好的办法？ where i can check if the column values start with the letters mentioned in remove_letters list and strip them at one shot ? 在哪里我可以检查列值是否以remove_letters列表中提到的字母开头并一次性剥离它们？

Issue 2 & 3 : What is objective of the function - "remove_strings" : Get only the 2 strings from the column names. 问题2和3：函数的目标是什么 - “remove_strings”：只从列名中获取2个字符串。 For eg : ASEEL ELECTRONICS T output will be : ASEEL ELECTRONICS 例如：ASEEL ELECTRONICS T输出将是：ASEEL ELECTRONICS

Is there a faster way for the functions : remove_strings,remove_end_digits 是否有更快的方法：remove_strings，remove_end_digits

Main issue : Can all this 3 functions can be done at one shot all together ? 主要问题：所有这3个功能都可以一次完成吗？

Expected dataframe: 预期数据框：

Names
GHAITHA
ASEEL ELECTRONICS
SUNRISE SUPERMARKET
EMARAT-AL SAFIYAH
LULU CENTRE
MAX

NOTE : The function "remove_start_words" should check if any of the mentioned letters are starting in the "Names" if so, remove them. 注意：函数“remove_start_words”应该检查是否有任何提到的字母在“名称”中开始，如果是，请删除它们。 For eg : "AL THEMAX" should be "THEMAX" not as "MAX" (removing both AL and THE) 例如：“AL THEMAX”应该是“THEMAX”而不是“MAX”（删除AL和THE）

Thanks in advance. 提前致谢。

Answer 1

You can use the replace method like this: 你可以像这样使用replace方法：

import pandas as pd

file_path = 'file3.xlsx'

df = pd.read_excel(file_path)

words_to_remove = ["THE", "AL"]

for word in words_to_remove:
    df.Names = df.Names.str.replace(word, "")

print(df)

Answer 2

Since you said you only wanted the words removed from the beginning of the sentence, you could use regular expression : 既然你说你只想从句子的开头删除单词，你可以使用正则表达式：

import pandas as pd

file_path = 'file3.xlsx'

df = pd.read_excel(file_path)

words_to_remove = ["THE", "AL"]
regular_expression = '^' + '|'.join(words_to_remove)

df.Names = df.Names.apply(lambda x : re.sub(regular_expression, "", x))

the regular_expression expression variable would contain ^THE|AL in this case meaning THE or AL at the beginning of the string. regular_expression表达式变量将包含^ THE | AL，在这种情况下意味着字符串开头的THE或AL。

Answer 3

A couple minutes of searching on Google tells me that 在Google上搜索几分钟告诉我

def stripper(delete_list):
    def delete(item):
        nonlocal delete_list
        for rm in delete_list:
            item = item.lstrip(rm)
        return item
    return delete

df['Names'] = df['Names'].apply(stripper(['AL', 'THE'])

should do the trick. 应该做的伎俩。

从字符串的开头删除提供的字母列表

问题描述

3 个解决方案

解决方案1
0 2019-05-27 09:23:20

解决方案2
0 已采纳 2019-05-27 09:58:18

解决方案3
0 2019-05-27 10:05:44

从字符串的开头删除提供的字母列表

问题描述

3 个解决方案

解决方案1 0 2019-05-27 09:23:20

解决方案2 0 已采纳 2019-05-27 09:58:18

解决方案3 0 2019-05-27 10:05:44

解决方案1
0 2019-05-27 09:23:20

解决方案2
0 已采纳 2019-05-27 09:58:18

解决方案3
0 2019-05-27 10:05:44