[英]Remove a list of provided letters from the start of the string
I have a dataframe-df with column "Names" as below : 我有一个dataframe-df,列名为“Names”,如下所示:
Names
AL GHAITHA & AL MOOSA
AL ASEEL ELECTRONICS T
SUNRISE SUPERMARKET-QU
EMARAT-AL SAFIYAH(6735
LULU CENTRE LLC EFT TE
THE MAX
Code : 代码:
remove_letters = ['AL ', 'THE ']
# my function below :
def remove_start_words(df, col, letters):
for l in letters:
for i in df.index:
x = df.at[i, col]
if x.startswith(l):
df.at[i, col] = x[len(l):]
else:
df.at[i, col] = x
def remove_strings(self, df, col):
for i in df.index:
x = df.at[i, col]
x = x.split(' ')
if len(x) > 1:
if len(x[1]) > 2:
x[1] = ''.join(e for e in x[1] if e.isalnum())
x = ' '.join(x[0:2])
df.at[i, col] = x
else:
df.at[i, col] = x[0]
else:
df.at[i, col] = df.at[i, col]
def remove_end_digits(self, df, col):
for i in df.index:
x = df.at[i, col]
df.at[i, col] = x.rstrip(string.digits)
# calling my function
remove_start_words(df=df, col='Names',
letters=remove_letters)
remove_strings(df=df, col='Names')
remove_end_digits(df=df, col='Names')
Now the problem is i have a dataframe of more than 1 million column values. 现在问题是我有一个超过100万列值的数据帧。 My code is not well optimized ?
我的代码没有很好地优化? How to get a optimized solution ?
如何获得优化的解决方案?
Issue 1 : I can understand i have used 2 loops ( 1 for remove_letters and other for all the column values) that is causing slowness. 问题1:我可以理解我已经使用了2个循环(1个用于remove_letters,其他用于所有列值),这导致了缓慢。
Is there a better way ? 有没有更好的办法 ? where i can check if the column values start with the letters mentioned in remove_letters list and strip them at one shot ?
在哪里我可以检查列值是否以remove_letters列表中提到的字母开头并一次性剥离它们?
Issue 2 & 3 : What is objective of the function - "remove_strings" : Get only the 2 strings from the column names. 问题2和3:函数的目标是什么 - “remove_strings”:只从列名中获取2个字符串。 For eg : ASEEL ELECTRONICS T output will be : ASEEL ELECTRONICS
例如:ASEEL ELECTRONICS T输出将是:ASEEL ELECTRONICS
Is there a faster way for the functions : remove_strings,remove_end_digits 是否有更快的方法:remove_strings,remove_end_digits
Main issue : Can all this 3 functions can be done at one shot all together ? 主要问题:所有这3个功能都可以一次完成吗?
Expected dataframe: 预期数据框:
Names
GHAITHA
ASEEL ELECTRONICS
SUNRISE SUPERMARKET
EMARAT-AL SAFIYAH
LULU CENTRE
MAX
NOTE : The function "remove_start_words" should check if any of the mentioned letters are starting in the "Names" if so, remove them. 注意:函数“remove_start_words”应该检查是否有任何提到的字母在“名称”中开始,如果是,请删除它们。 For eg : "AL THEMAX" should be "THEMAX" not as "MAX" (removing both AL and THE)
例如:“AL THEMAX”应该是“THEMAX”而不是“MAX”(删除AL和THE)
Thanks in advance. 提前致谢。
You can use the replace method like this: 你可以像这样使用replace方法:
import pandas as pd
file_path = 'file3.xlsx'
df = pd.read_excel(file_path)
words_to_remove = ["THE", "AL"]
for word in words_to_remove:
df.Names = df.Names.str.replace(word, "")
print(df)
Since you said you only wanted the words removed from the beginning of the sentence, you could use regular expression : 既然你说你只想从句子的开头删除单词,你可以使用正则表达式:
import pandas as pd
file_path = 'file3.xlsx'
df = pd.read_excel(file_path)
words_to_remove = ["THE", "AL"]
regular_expression = '^' + '|'.join(words_to_remove)
df.Names = df.Names.apply(lambda x : re.sub(regular_expression, "", x))
the regular_expression expression variable would contain ^THE|AL in this case meaning THE or AL at the beginning of the string. regular_expression表达式变量将包含^ THE | AL,在这种情况下意味着字符串开头的THE或AL。
A couple minutes of searching on Google tells me that 在Google上搜索几分钟告诉我
def stripper(delete_list):
def delete(item):
nonlocal delete_list
for rm in delete_list:
item = item.lstrip(rm)
return item
return delete
df['Names'] = df['Names'].apply(stripper(['AL', 'THE'])
should do the trick. 应该做的伎俩。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.