I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance()
or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap()
?. Everything I've tried with applymap()
or apply
or df.iterrows()
has returned errors of the kind AttributeError: "'float' object has no attribute 'index'"
. Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply()
accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.