简体   繁体   English

从字符串的开头删除提供的字母列表

[英]Remove a list of provided letters from the start of the string

I have a dataframe-df with column "Names" as below : 我有一个dataframe-df,列名为“Names”,如下所示:

Names
AL GHAITHA & AL MOOSA
AL ASEEL ELECTRONICS T
SUNRISE SUPERMARKET-QU
EMARAT-AL SAFIYAH(6735
LULU CENTRE LLC EFT TE
THE MAX

Code : 代码:

remove_letters = ['AL ', 'THE ']

# my function below :

def remove_start_words(df, col, letters):
    for l in letters:
        for i in df.index:
            x = df.at[i, col]
            if x.startswith(l):
                df.at[i, col] = x[len(l):]
            else:
                df.at[i, col] = x

def remove_strings(self, df, col):
    for i in df.index:
        x = df.at[i, col]
        x = x.split(' ')
        if len(x) > 1:
            if len(x[1]) > 2:
                x[1] = ''.join(e for e in x[1] if e.isalnum())
                x = ' '.join(x[0:2])
                df.at[i, col] = x
            else:
                df.at[i, col] = x[0]
        else:
            df.at[i, col] = df.at[i, col]


def remove_end_digits(self, df, col):
    for i in df.index:
        x = df.at[i, col]
        df.at[i, col] = x.rstrip(string.digits)

# calling my function
remove_start_words(df=df, col='Names',
                          letters=remove_letters)

remove_strings(df=df, col='Names')
remove_end_digits(df=df, col='Names')

Now the problem is i have a dataframe of more than 1 million column values. 现在问题是我有一个超过100万列值的数据帧。 My code is not well optimized ? 我的代码没有很好地优化? How to get a optimized solution ? 如何获得优化的解决方案?

Issue 1 : I can understand i have used 2 loops ( 1 for remove_letters and other for all the column values) that is causing slowness. 问题1:我可以理解我已经使用了2个循环(1个用于remove_letters,其他用于所有列值),这导致了缓慢。

Is there a better way ? 有没有更好的办法 ? where i can check if the column values start with the letters mentioned in remove_letters list and strip them at one shot ? 在哪里我可以检查列值是否以remove_letters列表中提到的字母开头并一次性剥离它们?

Issue 2 & 3 : What is objective of the function - "remove_strings" : Get only the 2 strings from the column names. 问题2和3:函数的目标是什么 - “remove_strings”:只从列名中获取2个字符串。 For eg : ASEEL ELECTRONICS T output will be : ASEEL ELECTRONICS 例如:ASEEL ELECTRONICS T输出将是:ASEEL ELECTRONICS

Is there a faster way for the functions : remove_strings,remove_end_digits 是否有更快的方法:remove_strings,remove_end_digits

Main issue : Can all this 3 functions can be done at one shot all together ? 主要问题:所有这3个功能都可以一次完成吗?

Expected dataframe: 预期数据框:

Names
GHAITHA
ASEEL ELECTRONICS
SUNRISE SUPERMARKET
EMARAT-AL SAFIYAH
LULU CENTRE
MAX

NOTE : The function "remove_start_words" should check if any of the mentioned letters are starting in the "Names" if so, remove them. 注意:函数“remove_start_words”应该检查是否有任何提到的字母在“名称”中开始,如果是,请删除它们。 For eg : "AL THEMAX" should be "THEMAX" not as "MAX" (removing both AL and THE) 例如:“AL THEMAX”应该是“THEMAX”而不是“MAX”(删除AL和THE)

Thanks in advance. 提前致谢。

You can use the replace method like this: 你可以像这样使用replace方法:

import pandas as pd

file_path = 'file3.xlsx'

df = pd.read_excel(file_path)

words_to_remove = ["THE", "AL"]

for word in words_to_remove:
    df.Names = df.Names.str.replace(word, "")

print(df)

Since you said you only wanted the words removed from the beginning of the sentence, you could use regular expression : 既然你说你只想从句子的开头删除单词,你可以使用正则表达式:

import pandas as pd

file_path = 'file3.xlsx'

df = pd.read_excel(file_path)

words_to_remove = ["THE", "AL"]
regular_expression = '^' + '|'.join(words_to_remove)

df.Names = df.Names.apply(lambda x : re.sub(regular_expression, "", x))

the regular_expression expression variable would contain ^THE|AL in this case meaning THE or AL at the beginning of the string. regular_expression表达式变量将包含^ THE | AL,在这种情况下意味着字符串开头的THE或AL。

A couple minutes of searching on Google tells me that 在Google上搜索几分钟告诉我

def stripper(delete_list):
    def delete(item):
        nonlocal delete_list
        for rm in delete_list:
            item = item.lstrip(rm)
        return item
    return delete

df['Names'] = df['Names'].apply(stripper(['AL', 'THE'])

should do the trick. 应该做的伎俩。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM