[英]match tickers using regular expression
I need to extract tickers (which are stock symbols is an abbreviation) from tweets, those tickers starts with $ (dollar sign) and composed of Uppercase letters and sometime "-".我需要从推文中提取代码(股票代码是缩写),这些代码以 $(美元符号)开头,由大写字母和有时“-”组成。 This is an example below:这是下面的示例:
str = "VG Acquisition Has The Potential To Fly High $SPCE $STPK $VG-AC price is $0.88"
I tries many regex but none of them returned what I need:我尝试了许多正则表达式,但没有一个返回我需要的内容:
\b\$.*\b
[$].*\s
[$].*\b
[$].*\s$
I need to match:我需要匹配:
$SPCE
$STPK
$VG-AC
I would have suggested something like that: re.findall(r'\$[AZ-?]+', text)我会建议这样的: re.findall(r'\$[AZ-?]+', text)
\$ = Start with $ \$ = 以 $ 开头
[AZ-?]+ = match uppercase letter with dash as a possibility. [AZ-?]+ = 可能匹配带有破折号的大写字母。 The + at the end for repeatability.末尾的 + 表示可重复性。
This regex works even with this pattern: ABS-DE-CE这个正则表达式甚至适用于这种模式:ABS-DE-CE
Use利用
re.findall(r'\$(?!\d+\.\d)\S+', text)
Explanation解释
--------------------------------------------------------------------------------
\$ '$'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
pytickersymbols , if it does what it says on the tin, should serve your purpose well. pytickersymbols ,如果它按照锡上所说的那样做,应该可以很好地满足您的目的。 From the tests :从测试:
import yfinance as yf
y_ticker = yf.Ticker('GOOG')
data = y_ticker.history(period='4d')
You can match 1 or more uppercase chars AZ.您可以匹配 1 个或多个大写字符 AZ。
Then optionally repeat matching -
and 1 or more uppercase chars AZ.然后可选地重复匹配-
和 1 个或多个大写字符 AZ。
\$[A-Z]+(?:-[A-Z]+)*\b
Explanation解释
\$[AZ]+
Match $
and 1 or more uppercase chars AZ \$[AZ]+
匹配$
和 1 个或多个大写字符 AZ(?:
Non capture group (?:
非捕获组
-[AZ]+
Match -
and 1 or more uppercase chars AZ -[AZ]+
匹配-
和 1 个或多个大写字符 AZ)*
Close group and repeat 0+ times )*
关闭组并重复 0+ 次\b
A word boundary \b
一个词的边界Regex demo |正则表达式演示| Python demo Python 演示
For example例如
import re
regex = r"\$[A-Z]+(?:-[A-Z]+)*\b"
s = "VG Acquisition Has The Potential To Fly High $SPCE $STPK $VG-AC price is $0.88"
print(re.findall(regex, s))
Output Output
['$SPCE', '$STPK', '$VG-AC']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.