从列表中的每个元素中检索特定的子字符串

Question

It is few hours I am stuck with this: I have a Series called size_col of 887 elements and I want to retrieve from the sizes: S, M, L, XL . 这是几个小时的问题，我被困住了：我有一个名为size_col的887元素系列，我想从尺寸中检索： S, M, L, XL 。 I have tried 2 different approaches, list comprehension and a simple if elif loop, but both attempts do not work. 我尝试了2种不同的方法，即列表理解和简单的if elif循环，但两种尝试均无效。

sizes = ['S', 'M', 'L', 'XL']

tshirt_sizes = []
[tshirt_sizes.append(i) for i in size_col if i in sizes]

Second attempt: 第二次尝试：

sizes = []
for i in size_col:
if len(i) < 15:
   sizes.append(i.split(" / ",1)[-1])
else:
   sizes.append(i.split(" - ",1)[-1])

I created two conditions because in some cases the size follows the ' - ' and in some other the is a '/' . 我创建了两个条件，因为在某些情况下大小遵循' - '而在另一些情况下则为'/' 。 I honestly don't know how do deal with that. 老实说，我不知道该如何处理。

Example of the list: 列表示例：

T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "Honey" - L
T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "I do very bad things" - M
T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "Stai nel tuo (mind your business)" - White / S
T-Shirt Donna "Stay Stronz" - White / L
T-Shirt Donna "Stay Stronz" - White / M
T-Shirt Donna "Si dai. Ciao." - S
T-Shirt Donna "Je suis esaurit" - Black / S
T-Shirt Donna "Si dai. Ciao." - S
T-Shirt Donna "Teamo - Tequila" - S / T-Shirt

Answer 1

You'll need regular expressions here. 您需要在这里使用正则表达式。 Precompile a regex pattern and then use pattern.search inside a list comprehension. 预编译正则表达式模式，然后在列表pattern.search使用pattern.search 。

sizes = ['S', 'M', 'L', 'XL']
p = re.compile(r'\b({})\b'.format('|'.join(sizes))) 

tshirt_sizes = [p.search(i).group(0) for i in size_col]

print(tshirt_sizes)
['M', 'L', 'M', 'M', 'M', 'S', 'L', 'M', 'S', 'S', 'S', 'S']

For added security, you may want a loop instead - list comprehensions are not good with error handling: 为了提高安全性，您可能需要循环处理-列表理解不适用于错误处理：

tshirt_sizes = []
for i in size_col:
    try:
        tshirt_sizes.append(p.search(i).group(0))
    except AttributeError:
        tshirt_sizes.append(None)

Really the only reason to use regex here is to handle the last row in your data appropriately. 真正在这里使用正则表达式的唯一原因是适当地处理数据的最后一行。 In general, if you can, you should prefer the use of string operations (namely, str.split ) unless avoidable, they're much faster and readable than regular expression based pattern matching and extraction. 通常，如果可以的话，除非可以避免，否则您应该更喜欢使用字符串操作（即str.split ），它们比基于正则表达式的模式匹配和提取要快得多且可读性强。

Answer 2

You can do something like that: 您可以执行以下操作：

available_sizes = ["S", "M", "L", "XL"]
sizes = []

for i in size_col:
    for w in i.split():
        if w in available_sizes:
            sizes.append(w)

This wouldn't work if the text contains the words in available_sizes more than once, for example T-Shirt Donna "La S è la più bella consonante" - M , since it would add both S and M to the list. 如果文本多次包含available_sizes中的单词，则此方法将不起作用，例如T-Shirt Donna "La S è la più bella consonante" - M ，因为它将S和M都添加到列表中。

Original answer, before OP specified that the size is not always the last word. 在OP指定大小不总是最后一个字之前的原始答案。

Almost. 几乎。 Just split the string in words and take the last one. 只需将字符串拆分成单词，然后取最后一个。

sizes = []
for i in size_col:
    sizes.append(i.split()[-1])

Answer 3

There are two aspects to this question, 1) the best method of looping over the element and 2) the correct way to split the string. 这个问题有两个方面，1）遍历元素的最佳方法，以及2）拆分字符串的正确方法。

In the general case, list comprehensions are probably the right approach for this type of problem, but you have correctly identified the splitting the string correctly is tricky. 在一般情况下，列表理解可能是解决此类问题的正确方法，但是您已经正确地识别出正确分割字符串是很棘手的。

For this type of problem regular expressions are very powerful and (at the risk of complicating this compared to the previous answers) you could use something like: 对于这种类型的问题，正则表达式非常强大，并且（与以前的答案相比，有使其复杂化的风险），您可以使用类似以下内容的方法：

import re
pattern = re.compile(r'[-/] (A-Z)$') # select any uppercase letters after either - or / and a space and before the end of the line (marked by $)

sizes = [pattern.search(item).group(1) for item in size_col] # group 1 selects the set of characters in the first set of parentheses (the letters)

Edited: just saw the edit to the posts stating that the item is not always at the end, and COLDSPEED's answer duplicates this one... 编辑：仅看到帖子的编辑，指出该条目并不总是在结尾处，而COLDSPEED的答案重复了这一条...

从列表中的每个元素中检索特定的子字符串

问题描述

3 个解决方案

解决方案1
3 已采纳 2018-04-12 09:16:51

解决方案2
0 2018-04-12 09:13:00

解决方案3
0 2018-04-12 09:25:26

从列表中的每个元素中检索特定的子字符串

问题描述

3 个解决方案

解决方案1 3 已采纳 2018-04-12 09:16:51

解决方案2 0 2018-04-12 09:13:00

解决方案3 0 2018-04-12 09:25:26

解决方案1
3 已采纳 2018-04-12 09:16:51

解决方案2
0 2018-04-12 09:13:00

解决方案3
0 2018-04-12 09:25:26