简体   繁体   中英

Partial String Matching in Pandas Dataframe

I have a dataframe that contains a string column with several different 4 character that can be separated by |or & , but not always. I am trying to map a dictionary to each discrete 4 character item but am running into issues. pandas ver 23.4

The basic code I am trying to use:

df = df.replace(dict, regex=True)

or if trying to select a specific col:

df['Col'] = df['Col'].replace(dict, regex=True)

Both raise the following error:

ValueError: The truth value of an array with more that one element is ambiguous. Use a.any() or a.all()

The values of the dictionary are type list . Is this something that would be an issue with performing the .replace ?

Update With Sample df and dict

 ID       Code
ABCD      00FQ
JKFA    8LK9|4F5H
QWST    2RLA|R1T5&8LK9


dict={'00FQ':['A','B'], '8LK9':['X'], '4F5H':['U','Z'], '2RLA':['H','K'], 'R1T5':['B','G'] }

The dict will have more elements in it than in the dataframe.

Update with expected output

 ID       Code           Logic
ABCD      00FQ          ['A','B']
JKFA    8LK9|4F5H       ['X'] | ['U','Z']
QWST    2RLA|R1T5&8LK9  ['H','K'] | ['B','G'] & ['X']

The overall goal is to perform this replace on two dataframes, and then compare the ID's on both sides for equivalence.

The regex defined in your dict might be matching with more than one rows of the dataframe, and python is confused about which replacement value to take from the dict.

And, when a numpy array is checked for its boolean value, this Error is forced to save users from guessing. Would you consider an array of elements to be True if

  • Any of its element is True or
  • All of its elements are True or
  • Something else.

Thus it throws this error to allow the programmer to explicitly mention it.

Go Here for more clarification.

Here's a function which will allow you to parse relevant values from your strings:

def string_to_list(string):
    """
    parses a parent string for 4 character children strings
    returns a list of children strings
    """
    # instantiate values
    child = ''
    children = []

    if len(string)<4:
        return None

    for n in string:
        # skip if not wanted
        if n in ['|','&']:
            continue

        child+=n
        if len(child)==4:
            children.append(child)
            child = ''

    # finished
    return children

Apply it to extract a list of values as follows:

df['Code_List'] = df['Code'].apply(string_to_list)

Map to relevant logic values:

# Instantiate the dictionary of logic rules
logic_dict = {'00FQ':['A','B'], '8LK9':['X'], '4F5H':['U','Z'], '2RLA':['H','K'], 'R1T5':['B','G'] }

# Map the logic rules
df['Logic_List'] = df['Code_List'].apply(lambda arr: [logic_dict[x] for x in arr])

# Final output
    ID      Code            Code_List           Logic_List
0   ABCD    00FQ            [00FQ]              [[A, B]]
1   JKFA    8LK9|4F5H       [8LK9, 4F5H]        [[X], [U, Z]]
2   QWST    2RLA|R1T5&8LK9  [2RLA, R1T5, 8LK9]  [[H, K], [B, G], [X]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM