简体   繁体   中英

Extracting 2 values from list with for-loop

I have a large Excel-sheet with has one column that contains several different identifiers (eg ISBNs). I have converted the sheet to a pandas dataframe and transformed the column with the identifiers to a list. A list entry of one row of the original column looks like this:

'ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534'

However, they aren't all the same, there are some with ISBNs, some don't have one, some have more entries, some less (5 in the example above) and the different IDs are mostly, but not all, separated by a comma.

In the next step, I have build a function that runs through the various list-items (one long string like the one above) and then splits this into the different words (so I get something like

'ISBN:978-9941-30-551-1', 'Broschur :', 'GEL', '14.90', 'IDN:1215507534'

I am looking to extract the values for ISBN and IDN, where present, to then add a designated column for ISBN and one for IDN to my original dataframe (instead of the "identifier"-column that contains the mixed data).

I now have the following code, which kind of does what it's supposed to, only I end up with lists in my dictionary and therefore a list for each entry in the resulting dataframe. I am sure there must be a better way of doing this, but cannot seem to think of it...

def find_stuff(item): 
        
    list_of_words = item.split()
    ISBN = list()
    IDN = list()
    
    for word in list_of_words:

        if 'ISBN' in word: 
            var = word
            var = var.replace("ISBN:", "")
            ISBN.append(var)
             
        if 'IDN' in word: 
            var2 = word
            var2 = var2.replace("IDN:", "")
            IDN.append(var2)

    
    sum_dict = {"ISBN":ISBN, "IDN":IDN}
    
    return sum_dict



output = [find_stuff(item) for item in id_lists]
print(output)

Any help very much appreciated :)

Since you are working in pandas I suggest using pandas' string methods to extract the relevant information and assign them to a new column directly. In the answer below I demonstrate some possibilities:

import pandas as pd

df = pd.DataFrame(['ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534'], columns=['identifier'])

def retrieve_text(lst, text):
    try:
        return [i for i in lst if text in i][0]
    except:
        return None

df['ISBN'] = df['identifier'].str.split().apply(lambda x: retrieve_text(x, 'ISBN')) #use a custom function to filter the list
df['IDN'] = df['identifier'].str.split().apply(lambda x: retrieve_text(x, 'IDN'))
df['name'] = df['identifier'].str.split().str[1] #get by index
df['price'] = df['identifier'].str.extract(r'(\d+\.\d+)').astype('float') #use regex, no need to split the string here

Output:

identifier ISBN IDN name price
0 ISBN:978-9941-30-551-1 Broschur : GEL 14.90, IDN:1215507534 ISBN:978-9941-30-551-1 IDN:1215507534 Broschur 14.9

You don't need your function, just apply a regex with named groups to the original column containing the long string.

Let's imagine this example:

df = pd.DataFrame({'other_column': ['blah', 'blah'],
                   'identifier': ['ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534',
                                  'ISBN:123-4567-89-012-3 blah IDN:1234567890 other'
                                 ],
                  })
  other_column                                                    identifier
0         blah  ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534
1         blah              ISBN:123-4567-89-012-3 blah IDN:1234567890 other

If ISBN is always before IDN , you can use pandas.Series.str.extract :

df['identifier'].str.extract('(?P<ISBN>ISBN:[\d-]+).*(?P<IDN>IDN:\d+)')

output:

                     ISBN             IDN
0  ISBN:978-9941-30-551-1  IDN:1215507534
1  ISBN:123-4567-89-012-3  IDN:1234567890

If there is a chance that there are not always in this order then use pandas.Series.str.extractall and rework the output with groupby :

(df['identifier'].str.extractall('(?P<ISBN>ISBN:[\d-]+)|(?P<IDN>IDN:\d+)')
                 .groupby(level=0).first()
)

Finally, if you don't want the identifier names, change a bit the regex to '(?:ISBN:(?P<ISBN>[\\d-]+))|(?:IDN:(?P<IDN>\\d+))' :

(df['identifier'].str.extractall('(?:ISBN:(?P<ISBN>[\d-]+))|(?:IDN:(?P<IDN>\d+))')
                 .groupby(level=0).first()
)

output:

                ISBN         IDN
0  978-9941-30-551-1  1215507534
1  123-4567-89-012-3  1234567890

NB. If you need a dictionary as output, you can append .to_dict('index') at the end of your command. This gives you

{0: {'ISBN': '978-9941-30-551-1', 'IDN': '1215507534'},
 1: {'ISBN': '123-4567-89-012-3', 'IDN': '1234567890'}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM