如何过滤查询集中的某些单词

Question

I have a variable, which contains stock symbols.我有一个变量，其中包含股票代码。 I need to split each symbol, to compute it independently.我需要拆分每个符号，以独立计算它。

print(Symbols_Splitted)

 #returns this 

["'['AAPL", 'TSLA', "MSFT']'"]

I need something to filter the relevant words, the pattern is always the same.我需要一些东西来过滤相关的词，模式总是一样的。

I tried this, which works but I find out an issue.我试过了，这很有效，但我发现了一个问题。 Some symbols have special characters in them like "EURUSD=X", and this code remove the "=" which makes it not valid.某些交易品种中有特殊字符，如“EURUSD=X”，此代码删除了“=”，使其无效。

            def convertor(s):
                perfect = re.sub('[^a-zA-Z]+', '', s)
                return perfect

            all = list(map(convertor, Symbols_Splitted))

So, by taking the first example I need something like this:所以，以第一个例子为例，我需要这样的东西：

Some_function(Symbols_Splitted)

Symbols_Splitted[0]
> AAPL
Symbols_Splitted[1]
> MSFT
Symbols_Splitted[2]
> TSLA

SOLUTION: I added = and - within the brackets so my function is now解决方案：我在括号内添加了 = 和 - 所以我的 function 现在是

            def convertor(s):
                perfect = re.sub('[^a-zA-Z-=-]+', '', s)
                return perfect

            all = list(map(convertor, Symbols_Splitted))

Answer 1

I don't think substitution is the optimal route to go here.我不认为替换是这里到 go 的最佳路径。 I would try an define the pattern you are interested in instead -- the ticker symbol.我会尝试定义您感兴趣的模式——股票代码。

I am not entirely sure what all the valid characters in a ticker symbol are and what rules apply to those symbols.我不完全确定股票代码中的所有有效字符是什么以及适用于这些符号的规则。 But judging from what I have read so far, it seems that the following holds:但从我目前所读的内容来看，似乎以下内容成立：

At least 2 characters long至少 2 个字符长
Must start and end with a latin letter or digit必须以拉丁字母或数字开头和结尾
Can contain letters, digits, dots and equals signs可以包含字母、数字、点和等号

With those rules, we can construct the following simple pattern:使用这些规则，我们可以构建以下简单模式：

\w[\w=.]*\w

The Python code could look like this: Python 代码可能如下所示：

import re


PATTERN_TICKER_SYMBOL = re.compile(r"\w[\w=.]*\w")


def extract_symbol(string: str) -> str:
    m = re.search(PATTERN_TICKER_SYMBOL, string)
    if m is None:
        raise ValueError(f"Cannot find ticker symbol in {string}")
    return m.group()


test_data = [
    "'['AAPL",
    "TSLA",
    "MSFT']'",
    "''''...BRK.A",
    "[][]EURUSD=X-...",
]
cleaned_data = [extract_symbol(s) for s in test_data]
print(cleaned_data)

Output: Output：

['AAPL', 'TSLA', 'MSFT', 'BRK.A', 'EURUSD=X']

With additional requirements, the pattern can be extended of course.如果有额外的要求，模式当然可以扩展。

如何过滤查询集中的某些单词

问题描述

1 个解决方案

解决方案1
1 2023-02-01 00:17:46

如何过滤查询集中的某些单词

问题描述

1 个解决方案

解决方案1 1 2023-02-01 00:17:46

解决方案1
1 2023-02-01 00:17:46