简体   繁体   English

用德语元音变音排序 pandas dataframe

[英]Sorting pandas dataframe with German Umlaute

I have a dataframe which I want to sort via sort_values on one column.我有一个 dataframe,我想通过一列上的 sort_values 对其进行排序。

Problem is there are German umlaute as first letter of the words.问题是德语元音变音作为单词的第一个字母。

Like Österreich, Zürich.像 Österreich,苏黎世。

Which will sort to Zürich, Österreich.它将分拣到苏黎世,Österreich。 It should be sorting Österreich, Zürich.应该是在 Österreich, Zürich 分拣。

Ö should be between N and O. Ö 应该在 N 和 O 之间。

I have found out how to do this with lists in python using locale and strxfrm.我已经找到了如何使用语言环境和 strxfrm 对 python 中的列表执行此操作。 Can I do this in the pandas dataframe somehow directly?我可以以某种方式直接在 pandas dataframe 中执行此操作吗?

Edit : Thank You.编辑:谢谢。 Stef example worked quite well, somehow I had Numbers where his Version did not work with my real life Dataframe example, so I used alexey's idea. Stef 示例工作得很好,不知何故我有 Numbers,其中他的版本不适用于我现实生活中的 Dataframe 示例,所以我使用了 alexey 的想法。 I did the following, probably you can shorten this..:我做了以下,也许你可以缩短这个..:


df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b', 'v']})

#create index as column for joining later
df = df.reset_index(drop=False)

#convert int to str
df['location']=df['location'].astype(str)

#sort by location with umlaute
df_sort_index = df['location'].str.normalize('NFD').sort_values(ascending=True).reset_index(drop=False)

#drop location so we dont have it in both tables
df = df.drop('location', axis=1)

#inner join on index
new_df = pd.merge(df_sort_index, df, how='inner', on='index')

#drop index as column
new_df = new_df.drop('index', axis=1)

you can use unicode NFD normal form 你可以使用unicode NFD普通表格

>>> names = pd.Series(['Österreich', 'Ost', 'S', 'N'])
>>> names.str.normalize('NFD').sort_values()
3              N
1            Ost
0    Österreich
2              S
dtype: object

# use result to rearrange a dataframe
>>> df[names.str.normalize('NFD').sort_values().index]

It's not quite what you wanted, but for proper ordering you need language knowladge (like locale you mentioned). 这不是你想要的,但为了正确的订购,你需要语言知识(就像你提到的语言环境)。

NFD employs two symbols for umlauts eg Ö becomes O\\xcc\\x88 (you can see the difference with names.str.normalize('NFD').encode('utf-8') ) NFD为变音符号使用两个符号,例如Ö变为O\\xcc\\x88 (你可以看到与names.str.normalize('NFD').encode('utf-8')的区别names.str.normalize('NFD').encode('utf-8')

You could use sorted with a locale aware sorting function (in my example, setlocale returned 'German_Germany.1252' ) to sort the column values. 你可以使用sorted了一个语言环境感知排序功能(在我的例子, setlocale返回'German_Germany.1252' )的列值进行排序。 The tricky part is to sort all the other columns accordingly. 棘手的部分是相应地对所有其他列进行排序。 A somewhat hacky solution would be to temporarily set the index to the column to be sorted and then reindex on the properly sorted index values and reset the index. 一个有点hacky的解决方案是临时将索引设置为要排序的列,然后重新索引正确排序的索引值并重置索引。

import functools
import locale
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern'],'code':['ö','z','b']})

df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()

Output of print(df): 打印输出(df):

     location code
0        Bern    b
1  Österreich    ö
2      Zürich    z


Update for mixed type columns If the column to be sorted is of mixed types (eg strings and integers), then you have two possibilities: 混合类型列的更新如果要排序的列是混合类型(例如字符串和整数),那么您有两种可能性:

a) convert the column to string and then sort as written above (result column will be all strings): a)将列转换为字符串,然后按上面所述排序(结果列将是所有字符串):

 locale.setlocale(locale.LC_ALL, '') df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']}) df.location=df.location.astype(str) df = df.set_index('location') df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index() print(df.location.values) # ['254345' 'Bern' 'Österreich' 'Zürich'] 

b) sort on a copy of the column converted to string (result column will retain mixed types) b)对转换为字符串的列的副本进行排序(结果列将保留混合类型)

 locale.setlocale(locale.LC_ALL, '') df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']}) df = df.set_index(df.location.astype(str)) df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index(drop=True) print(df.location.values) # [254345 'Bern' 'Österreich' 'Zürich'] 

Sort with locale:按语言环境排序:

import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'de_de')
#codes: https://github.com/python/cpython/blob/3.10/Lib/locale.py

#create df
df = pd.DataFrame({'location': ['Zürich','Österreich','Bern', 254345],'code':['z','ö','b','v']})
#convert int to str
df['location']=df['location'].astype(str)
#sort
df_ord = df.sort_values(by = 'location', key = lambda col: col.map(lambda x: locale.strxfrm(x)))

Multisort with locale:多排序与语言环境:

import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'es_es')

# create df
lista1 = ['sarmiento', 'ñ', 'á', 'sánchez', 'a', 'ó', 's', 'ñ', 'á', 'sánchez']
lista2 = [10, 20, 60, 40, 20, 20, 10, 5, 30, 20]
df = pd.DataFrame(list(zip(lista1, lista2)), columns = ['Col1', 'Col2'])

#sort by Col2 and Col1
df_temp = df.sort_values(by = 'Col2')
df_ord = df_temp.sort_values(by = 'Col1', key = lambda col: col.map(lambda x: locale.strxfrm(x)), kind = 'mergesort')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM