[英]Find character at which string can be differentiated from list of strings
For every string in a df column, I need the character at which this string becomes unique, that is, its uniqueness point (UP).对于 df 列中的每个字符串,我需要该字符串变得唯一的字符,即它的唯一性点 (UP)。 For illustration, here is a toy dataframe:为了便于说明,这里有一个玩具 dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'word':['can', 'cans', 'canse', 'canpe', 'canp', 'camp'],
'code':['k@n', 'k@n}', 'k@(z', np.nan, 'k@()', np.nan]})
word code
0 can k@n
1 cans k@n}
2 canse k@(z
3 canpe
4 canp k@()
5 camp
The expected result is given below.预期结果如下。 I computed the UP for the two columns word
and code
:我计算了两列word
和code
的 UP:
word code wordUP codeUP
0 can k@n 4 4 # 'can' can be discriminated from 'cans' at the imagined fourth letter, which does not exist
1 cans k@n} 5 4
2 canse k@(z 5 4
3 canpe 5 # empty cells don't have a UP
4 canp k@() 5 4
5 camp 3
My current implementation works, but is too slow for my 100k row dataframe. You can see it below.我当前的实现有效,但对于我的 100k 行 dataframe 来说太慢了。您可以在下面看到它。 Can you come up with something faster?你能想出更快的东西吗?
def slice_prefix(a, b, start=0, length=1):
while 1:
while a[start:start + length] == b[start:start + length]:
start += length
length += length
if length > 1:
length = 1
else:
return start
df = df.fillna('')
# get possible orthographic and phonetic representations
all_words = df['word'].dropna().to_list()
all_codes = df['code'].dropna().to_list()
# prepare new columns
df['wordUP'] = np.nan
df['codeUP'] = np.nan
# compute UP
for idx,row in df.iterrows():
word = row['word']
code = row['code']
wordUP = max([slice_prefix(word, item) for item in all_words if item != word]) + 1
codeUP = max([slice_prefix(code, item) for item in all_codes if item != code]) + 1
df.loc[idx, 'wordUP'] = wordUP
df.loc[idx, 'codeUP'] = codeUP
df.loc[df['code']=='', 'codeUP'] = 0
As it is, your code runs in 0.0012 second in average (10,000 iterations) on my computer.实际上,您的代码在我的计算机上平均运行 0.0012 秒(10,000 次迭代)。
I suggest you refactor your code in a way which is both faster (0.0008 second, -33%) and more idiomatic, thus readable:我建议您以更快(0.0008 秒,-33%)和更惯用的方式重构您的代码,从而提高可读性:
import numpy as np
import pandas as pd
def slice_prefix(a, b, start=0, length=1):
while 1:
while a[start:start + length] == b[start:start + length]:
start += length
length += length
if length > 1:
length = 1
else:
return start
def find_up(x, strings):
"""New helper function"""
return int(max([slice_prefix(x, item) for item in strings if item != x])) + 1
df = df.fillna(" ")
df = (
df
.assign(wordUP=df["word"].apply(lambda x: find_up(x, df["word"])))
.assign(codedUP=df["code"].apply(lambda x: find_up(x, df["code"])))
.replace(1, "")
)
print(df)
# Output
word code wordUP codedUP
0 can k@n 4 4
1 cans k@n} 5 4
2 canse k@(z 5 4
3 canpe 5
4 canp k@() 5 4
5 camp 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.