简体   繁体   English

使用正则表达式匹配代码

[英]match tickers using regular expression

I need to extract tickers (which are stock symbols is an abbreviation) from tweets, those tickers starts with $ (dollar sign) and composed of Uppercase letters and sometime "-".我需要从推文中提取代码(股票代码是缩写),这些代码以 $(美元符号)开头,由大写字母和有时“-”组成。 This is an example below:这是下面的示例:

str = "VG Acquisition Has The Potential To Fly High $SPCE $STPK $VG-AC price is $0.88"

I tries many regex but none of them returned what I need:我尝试了许多正则表达式,但没有一个返回我需要的内容:

\b\$.*\b
[$].*\s     
[$].*\b
[$].*\s$

I need to match:我需要匹配:

$SPCE 
$STPK 
$VG-AC

I would have suggested something like that: re.findall(r'\$[AZ-?]+', text)我会建议这样的: re.findall(r'\$[AZ-?]+', text)

\$ = Start with $ \$ = 以 $ 开头

[AZ-?]+ = match uppercase letter with dash as a possibility. [AZ-?]+ = 可能匹配带有破折号的大写字母。 The + at the end for repeatability.末尾的 + 表示可重复性。

This regex works even with this pattern: ABS-DE-CE这个正则表达式甚至适用于这种模式:ABS-DE-CE

Use利用

re.findall(r'\$(?!\d+\.\d)\S+', text)

See proof .证明

Explanation解释

--------------------------------------------------------------------------------
  \$                       '$'
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  \S+                      non-whitespace (all but \n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))

pytickersymbols , if it does what it says on the tin, should serve your purpose well. pytickersymbols ,如果它按照锡上所说的那样做,应该可以很好地满足您的目的。 From the tests :测试

import yfinance as yf
y_ticker = yf.Ticker('GOOG')
data = y_ticker.history(period='4d')

You can match 1 or more uppercase chars AZ.您可以匹配 1 个或多个大写字符 AZ。

Then optionally repeat matching - and 1 or more uppercase chars AZ.然后可选地重复匹配-和 1 个或多个大写字符 AZ。

\$[A-Z]+(?:-[A-Z]+)*\b

Explanation解释

  • \$[AZ]+ Match $ and 1 or more uppercase chars AZ \$[AZ]+匹配$和 1 个或多个大写字符 AZ
  • (?: Non capture group (?:非捕获组
    • -[AZ]+ Match - and 1 or more uppercase chars AZ -[AZ]+匹配-和 1 个或多个大写字符 AZ
  • )* Close group and repeat 0+ times )*关闭组并重复 0+ 次
  • \b A word boundary \b一个词的边界

Regex demo |正则表达式演示| Python demo Python 演示

For example例如

import re
 
regex = r"\$[A-Z]+(?:-[A-Z]+)*\b"
s = "VG Acquisition Has The Potential To Fly High $SPCE $STPK $VG-AC price is $0.88"
print(re.findall(regex, s))

Output Output

['$SPCE', '$STPK', '$VG-AC']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM