简体   繁体   English

在 Pandas 中模糊搜索列

[英]Fuzzy Searching a Column in Pandas

Is there a way to search for a value in a dataframe column using FuzzyWuzzy or similar library?有没有办法使用FuzzyWuzzy或类似库在数据FuzzyWuzzy列中搜索值? I'm trying to find a value in one column that corresponds to the value in another while taking fuzzy matching into account.我试图在考虑到模糊匹配的同时在一个列中找到一个与另一列中的值相对应的值。 So所以

So for example, if I have State Names in one column and State Codes in another, how would I find the state code for Florida, which is FL while catering for abbreviations like "Flor"?例如,如果我在一列中有州名,在另一列中有州代码,我将如何找到佛罗里达州的州代码,即 FL 同时满足“Flor”之类的缩写?

So in other words, I want to find a match for a State Name corresponding to "Flor" and get the corresponding State Code "FL".所以换句话说,我想找到与“Flor”对应的 State Name 的匹配项,并获得相应的 State Code“FL”。

Any help is greatly appreciated.任何帮助是极大的赞赏。

If the abbreviations are all prefixes , you can use the .startswith() string method against either the short or long version of the state.如果缩写都是prefixes ,您可以使用.startswith()字符串方法针对状态的短版本或长版本。

>>> test_value = "Flor"
>>> test_value.upper().startswith("FL")
True
>>> "Florida".lower().startswith(test_value.lower())
True

However, if you have more complex abbreviations, difflib.get_close_matches will probably do what you want!但是,如果您有更复杂的缩写, difflib.get_close_matches可能会满足您的需求!

>>> import pandas as pd
>>> import difflib
>>> df = pd.DataFrame({"states": ("Florida", "Texas"), "st": ("FL", "TX")})
>>> df
    states  st
0  Florida  FL
1    Texas  TX
>>> difflib.get_close_matches("Flor", df["states"].to_list())
['Florida']
>>> difflib.get_close_matches("x", df["states"].to_list(), cutoff=0.2)
['Texas']
>>> df["st"][df.index[df["states"]=="Texas"]].iloc[0]
'TX'

You will probably want to try/except IndexError around reading the first member of the returned list from difflib and possibly tweak the cutoff to get less false matches with close states (perhaps offer all the states as possibilities to some user or require more letters for close states).你可能会想尝试/除非IndexError围绕阅读从difflib返回列表的第一个成员,并可能调整截止到获得与闭合状态更少的错误匹配(也许提供所有国家的可能性,一些用户还是需要密切多个字母状态)。

You may also see the best results combining the two;您可能还会看到将两者结合起来的最佳结果; testing prefixes first before trying the fuzzy match.在尝试模糊匹配之前首先测试前缀。

Putting it all together把这一切放在一起

def state_from_partial(test_text, df, col_fullnames, col_shortnames):
    if len(test_text) < 2:
        raise ValueError("must have at least 2 characters")

    # if there's exactly two characters, try to directly match short name
    if len(test_text) == 2 and test_text.upper() in df[col_shortnames]:
        return test_text.upper()

    states = df[col_fullnames].to_list()
    match = None
    # this will definitely fail at least for states starting with M or New
    #for state in states:
    #    if state.lower().startswith(test_text.lower())
    #        match = state
    #        break  # leave loop and prepare to find the prefix

    if not match:
        try:  # see if there's a fuzzy match
            match = difflib.get_close_matches(test_text, states)[0]  # cutoff=0.6
        except IndexError:
            pass  # consider matching against a list of problematic states with different cutoff

    if match:
        return df[col_shortnames][df.index[df[col_fullnames]==match]].iloc[0]

    raise ValueError("couldn't find a state matching partial: {}".format(test_text))

Beware of states which start with 'New' or 'M' (and probably others), which are all pretty close and will probably want special handling.注意以“New”或“M”(可能还有其他)开头的状态,它们都非常接近并且可能需要特殊处理。 Testing will do wonders here.测试将在这里创造奇迹。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM