[英]Python: Algorithm to find similar strings in a list
我無法用這個東西來構建我的想法。 我希望你能幫助我。 我有這樣的財務報告:
CONSOLIDATED BALANCE SHEETS - USD ($) $ in Millions Sep. 28, 2019 Sep. 29, 2018
0 Current assets: NaN NaN
1 Cash and cash equivalents 48844 25913
2 Marketable securities 51713 40388
3 Accounts receivable, net 22926 23186
4 Inventories 4106 3956
5 Vendor non-trade receivables 22878 25809
6 Other current assets 12352 12087
7 Total current assets 162819 131339
8 Non-current assets: NaN NaN
9 Marketable securities 105341 170799
10 Property, plant and equipment, net 37378 41304
11 Other non-current assets 32978 22283
12 Total non-current assets 175697 234386
13 Total assets 338516 365725
14 Current liabilities: NaN NaN
15 Accounts payable 46236 55888
16 Other current liabilities 37720 33327
17 Deferred revenue 5522 5966
18 Commercial paper 5980 11964
19 Term debt 10260 8784
20 Total current liabilities 105718 115929
21 Non-current liabilities: NaN NaN
22 Term debt 91807 93735
23 Other non-current liabilities 50503 48914
24 Total non-current liabilities 142310 142649
25 Total liabilities 248028 258578
26 Commitments and contingencies
27 Shareholders’ equity: NaN NaN
28 Common stock and additional paid-in capital, $... 45174 40201
29 Retained earnings 45898 70400
30 Accumulated other comprehensive income/(loss) -584 -3454
31 Total shareholders’ equity 90488 107147
32 Total liabilities and shareholders’ equity 338516 365725
這是 pandas Dataframe 從 excel 讀取的。 我想 - 使用一些算法 - 得到這個 output:
CONSOLIDATED BALANCE SHEETS - USD ($) $ in Millions Sep. 28, 2019 Sep. 29, 2018
0 Cash and cash equivalents 48844 25913
1 Total current assets 162819 131339
2 Property, plant and equipment, net 37378 41304
3 Total non-current assets 175697 234386
4 Total assets 338516 365725
5 Accounts payable 46236 55888
6 Total current liabilities 105718 115929
Total debt 108047 114483
7 Total non-current liabilities 142310 142649
8 Total liabilities 248028 258578
9 Total shareholders’ equity 90488 107147
基本上問題是,使用給定的鍵值,在 DataFrame 的第一列中搜索並返回每個匹配的行。 只有一個 dataframe 很容易,因為鍵值與搜索的值完全相同。 但實際上並非如此。 我有數千份報告,其中搜索的值略有不同。 例如:鍵值 = Cash
,df 中的值 = Cash and Cash equivalents
,鍵值 = net sales
,df 中的值 = net revenue
到目前為止我嘗試了什么? 我已經嘗試過fuzzywuzzy
模塊,但有時它不能正常工作。 有任何想法嗎?
處理這種搜索的一種方法是添加分類名稱以使其更容易縮小范圍。 如果你想知道當前資產的總和,你可以提取'Class 1'作為當前資產,'flg'作為總資產,並且使用str.contains()
來進行模糊搜索是個好主意。 注意:列名在創建代碼時已更改。
df.replace('NaN', np.NaN, inplace=True)
df.rename(columns={'CONSOLIDATED BALANCE SHEETS - USD ($) $ in Millions':'accounts','Sep. 28, 2019':'this_year','Sep. 29, 2018':'last_year'}, inplace=True)
df['NO'] = np.arange(len(df))
df['Class1'] = df['accounts'][df.isnull().any(axis=1)]
df['Class1'] = df['Class1'].fillna(method='ffill')
df['flg'] = np.where(df['accounts'].str.contains(r'^(Total)'), 'total', 'items')
df
| | accounts | this_year | last_year | NO | Class1 | flg |
|---:|:--------------------------------------------------|------------:|------------:|-----:|:------------------------------|:------|
| 0 | Current assets: | nan | nan | 0 | Current assets: | items |
| 1 | Cash and cash equivalents | 48844 | 25913 | 1 | Current assets: | items |
| 2 | Marketable securities | 51713 | 40388 | 2 | Current assets: | items |
| 3 | Accounts receivable, net | 22926 | 23186 | 3 | Current assets: | items |
| 4 | Inventories | 4106 | 3956 | 4 | Current assets: | items |
| 5 | Vendor non-trade receivables | 22878 | 25809 | 5 | Current assets: | items |
| 6 | Other current assets | 12352 | 12087 | 6 | Current assets: | items |
| 7 | Total current assets | 162819 | 131339 | 7 | Current assets: | total |
| 8 | Non-current assets: | nan | nan | 8 | Non-current assets: | items |
| 9 | Marketable securities | 105341 | 170799 | 9 | Non-current assets: | items |
| 10 | roperty, plant and equipment, net | 37378 | 41304 | 10 | Non-current assets: | items |
| 11 | Other non-current assets | 32978 | 22283 | 11 | Non-current assets: | items |
| 12 | Total non-current assets | 175697 | 234386 | 12 | Non-current assets: | total |
| 13 | Total assets | 338516 | 365725 | 13 | Non-current assets: | total |
| 14 | Current liabilities: | nan | nan | 14 | Current liabilities: | items |
| 15 | Accounts payable | 46236 | 55888 | 15 | Current liabilities: | items |
| 16 | Other current liabilities | 37720 | 33327 | 16 | Current liabilities: | items |
| 17 | Deferred revenue | 5522 | 5966 | 17 | Current liabilities: | items |
| 18 | Commercial paper | 5980 | 11964 | 18 | Current liabilities: | items |
| 19 | Term debt | 10260 | 8784 | 19 | Current liabilities: | items |
| 20 | Total current liabilities | 105718 | 115929 | 20 | Current liabilities: | total |
| 21 | Non-current liabilities: | nan | nan | 21 | Non-current liabilities: | items |
| 22 | Term debt | 91807 | 93735 | 22 | Non-current liabilities: | items |
| 23 | Other non-current liabilities | 50503 | 48914 | 23 | Non-current liabilities: | items |
| 24 | Total non-current liabilities | 142310 | 142649 | 24 | Non-current liabilities: | total |
| 25 | Total liabilities | 248028 | 258578 | 25 | Non-current liabilities: | total |
| 26 | Commitments and contingencies | nan | nan | 26 | Commitments and contingencies | items |
| 27 | Shareholders’ equity: | nan | nan | 27 | Shareholders’ equity: | items |
| 28 | Common stock and additional paid-in capital, $... | 45174 | 40201 | 28 | Shareholders’ equity: | items |
| 29 | Retained earnings | 45898 | 70400 | 29 | Shareholders’ equity: | items |
| 30 | Accumulated other comprehensive income/(loss) | -584 | -3454 | 30 | Shareholders’ equity: | items |
| 31 | Total shareholders’ equity | 90488 | 107147 | 31 | Shareholders’ equity: | total |
| 32 | Total liabilities and shareholders’ equity | 338516 | 365725 | 32 | Shareholders’ equity: | total |
例如: str.contains()
df[df['accounts'].str.contains('Accounts payable')]
accounts this_year last_year NO Class1 flg
15 Accounts payable 46236.0 55888.0 15 Current liabilities: items
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.