I am trying to find close matches between a string of text and two columns of my data frame—'tickers' and/or 'company'.
This is a sample of the data frame:
cik | tickers | company |
--------------------------------------------------
1090872 | A | Agilent Technologies Inc |
--------------------------------------------------
4281 | AA | Alcoa Inc |
--------------------------------------------------
6201 | AAL | American Airlines Group Inc|
--------------------------------------------------
8177 | AAME | Atlantic American Corp |
--------------------------------------------------
706688 | AAN | Aarons Inc |
--------------------------------------------------
320193 | AAPL | Apple Inc |
--------------------------------------------------
And this is how some text might look:
text = 'consectetur elementum Apple Inc Agilent Inc. Aenean porttitor porta magna AA American Airlines AAMC Aarons Inc AAPL e plumbs ernum. AA'
I would like to find all close matches in this text, and make the output something like:
The following companies were found in 'text':
- AAPL: Apple Inc
- A: Agilent Technologies Inc
- AA: American Airlines Group Inc
- AAN: Aarons Inc
Here's the code I have so far, but it's incomplete and I recognize it needs a different approach:
import pandas as pd
import re
data = {'cik': ['1090872', '4281', '6201', '8177', '706688', '320193'], 'ticker': ['A', 'AA', 'AAL', 'AAME', 'AAN', 'AAPL'], 'company': ['Agilent Technologies Inc', 'Alcoa Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'Aarons Inc', 'Apple Inc']}
df = pd.DataFrame(data, columns=['cik', 'ticker', 'company'])
text = 'consectetur elementum Apple Inc Agilent Inc. Aenean porttitor porta magna AA American Airlines AAMC Aarons Inc AAPL e plumbs ernum. AA'
ticker = df['ticker']
regex = re.compile(r"\b(?:" + "|".join(map(re.escape, ticker)) + r")\b")
matches = re.findall(regex, text)
for match in matches:
print(match)
Here's how I would tackle this. First of all the set up based on your code
import pandas as pd
import numpy as np
data = [['1090872', 'A', 'Agilent Technologies Inc'], ['4281', 'AA', 'Alcoa Inc'],
['6201', 'AAL', 'American Airlines Group Inc'], ['8177', 'AAME', 'Atlantic American Corp'],
['706688', 'AAN', 'Aarons Inc'], ['320193', 'AAPL', 'Apple Inc']]
df = pd.DataFrame(data, columns=['cik', 'tickers', 'company'])
text = "consectetur elementum Apple Inc Agilent Inc. Aenean porttitor porta magna AA American \
Airlines AAMC Aarons Inc AAPL e plumbs ernum. AA"
df['text'] = text
df['found'] = None
company_values = df['company'].values
for val in company_values:
row = df.loc[df['company'] == val]
if row['text'].str.contains(val).any():
df.loc[df['company'] == val, 'found'] = 'Yes'
# filter the results
df.loc[df['found'] == 'Yes']
I think making text part of the dataframe and then searching which of the company's actually are found and then recording this in the df['found']
column, which you can then filter to find the list of companies. Here I make an assumption that the dataframe contains only unique company names with their tickers.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.