简体   繁体   English

使用 for 循环从列表中提取 2 个值

[英]Extracting 2 values from list with for-loop

I have a large Excel-sheet with has one column that contains several different identifiers (eg ISBNs).我有一个很大的 Excel 工作表,其中有一列包含几个不同的标识符(例如 ISBN)。 I have converted the sheet to a pandas dataframe and transformed the column with the identifiers to a list.我已将工作表转换为熊猫数据框,并将带有标识符的列转换为列表。 A list entry of one row of the original column looks like this:原始列的一行的列表条目如下所示:

'ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534'

However, they aren't all the same, there are some with ISBNs, some don't have one, some have more entries, some less (5 in the example above) and the different IDs are mostly, but not all, separated by a comma.但是,它们并不完全相同,有些带有 ISBN,有些没有,有些条目较多,有些条目较少(在上面的示例中为 5),并且不同的 ID 大部分(但不是全部)由逗号。

In the next step, I have build a function that runs through the various list-items (one long string like the one above) and then splits this into the different words (so I get something like在下一步中,我构建了一个函数,该函数遍历各种列表项(一个长字符串,如上面的那个),然后将其拆分为不同的单词(所以我得到类似

'ISBN:978-9941-30-551-1', 'Broschur :', 'GEL', '14.90', 'IDN:1215507534'

I am looking to extract the values for ISBN and IDN, where present, to then add a designated column for ISBN and one for IDN to my original dataframe (instead of the "identifier"-column that contains the mixed data).我希望提取 ISBN 和 IDN(如果存在)的值,然后将 ISBN 的指定列和 IDN 的指定列添加到我的原始数据帧(而不是包含混合数据的“标识符”列)。

I now have the following code, which kind of does what it's supposed to, only I end up with lists in my dictionary and therefore a list for each entry in the resulting dataframe.我现在有以下代码,哪种代码可以完成它应该做的事情,但最终我的字典中只有列表,因此结果数据框中的每个条目都有一个列表。 I am sure there must be a better way of doing this, but cannot seem to think of it...我相信一定有更好的方法来做到这一点,但似乎无法想到......

def find_stuff(item): 
        
    list_of_words = item.split()
    ISBN = list()
    IDN = list()
    
    for word in list_of_words:

        if 'ISBN' in word: 
            var = word
            var = var.replace("ISBN:", "")
            ISBN.append(var)
             
        if 'IDN' in word: 
            var2 = word
            var2 = var2.replace("IDN:", "")
            IDN.append(var2)

    
    sum_dict = {"ISBN":ISBN, "IDN":IDN}
    
    return sum_dict



output = [find_stuff(item) for item in id_lists]
print(output)

Any help very much appreciated :)非常感谢任何帮助:)

Since you are working in pandas I suggest using pandas' string methods to extract the relevant information and assign them to a new column directly.由于您在pandas工作,我建议使用熊猫的字符串方法来提取相关信息并将它们直接分配给新列。 In the answer below I demonstrate some possibilities:在下面的答案中,我展示了一些可能性:

import pandas as pd

df = pd.DataFrame(['ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534'], columns=['identifier'])

def retrieve_text(lst, text):
    try:
        return [i for i in lst if text in i][0]
    except:
        return None

df['ISBN'] = df['identifier'].str.split().apply(lambda x: retrieve_text(x, 'ISBN')) #use a custom function to filter the list
df['IDN'] = df['identifier'].str.split().apply(lambda x: retrieve_text(x, 'IDN'))
df['name'] = df['identifier'].str.split().str[1] #get by index
df['price'] = df['identifier'].str.extract(r'(\d+\.\d+)').astype('float') #use regex, no need to split the string here

Output:输出:

identifier标识符 ISBN国际标准书号 IDN国际化域名 name名称 price价钱
0 0 ISBN:978-9941-30-551-1 Broschur : GEL 14.90, IDN:1215507534 ISBN:978-9941-30-551-1 Broschur: GEL 14.90, IDN:1215507534 ISBN:978-9941-30-551-1 ISBN:978-9941-30-551-1 IDN:1215507534 IDN:1215507534 Broschur小册子 14.9 14.9

You don't need your function, just apply a regex with named groups to the original column containing the long string.您不需要函数,只需将带有命名组正则表达式应用于包含长字符串的原始列。

Let's imagine this example:让我们想象一下这个例子:

df = pd.DataFrame({'other_column': ['blah', 'blah'],
                   'identifier': ['ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534',
                                  'ISBN:123-4567-89-012-3 blah IDN:1234567890 other'
                                 ],
                  })
  other_column                                                    identifier
0         blah  ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534
1         blah              ISBN:123-4567-89-012-3 blah IDN:1234567890 other

If ISBN is always before IDN , you can use pandas.Series.str.extract :如果ISBN总是在IDN之前,您可以使用pandas.Series.str.extract

df['identifier'].str.extract('(?P<ISBN>ISBN:[\d-]+).*(?P<IDN>IDN:\d+)')

output:输出:

                     ISBN             IDN
0  ISBN:978-9941-30-551-1  IDN:1215507534
1  ISBN:123-4567-89-012-3  IDN:1234567890

If there is a chance that there are not always in this order then use pandas.Series.str.extractall and rework the output with groupby :如果有可能不总是按此顺序存在,则使用pandas.Series.str.extractall并使用groupby重新处理输出:

(df['identifier'].str.extractall('(?P<ISBN>ISBN:[\d-]+)|(?P<IDN>IDN:\d+)')
                 .groupby(level=0).first()
)

Finally, if you don't want the identifier names, change a bit the regex to '(?:ISBN:(?P<ISBN>[\\d-]+))|(?:IDN:(?P<IDN>\\d+))' :最后,如果您不想要标识符名称,请将正则表达式更改为'(?:ISBN:(?P<ISBN>[\\d-]+))|(?:IDN:(?P<IDN>\\d+))' :

(df['identifier'].str.extractall('(?:ISBN:(?P<ISBN>[\d-]+))|(?:IDN:(?P<IDN>\d+))')
                 .groupby(level=0).first()
)

output:输出:

                ISBN         IDN
0  978-9941-30-551-1  1215507534
1  123-4567-89-012-3  1234567890

NB.注意。 If you need a dictionary as output, you can append .to_dict('index') at the end of your command.如果您需要字典作为输出,您可以在命令末尾附加.to_dict('index') This gives you这给你

{0: {'ISBN': '978-9941-30-551-1', 'IDN': '1215507534'},
 1: {'ISBN': '123-4567-89-012-3', 'IDN': '1234567890'}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM