簡體   English   中英

Python:在列表中查找相似字符串的算法

[英]Python: Algorithm to find similar strings in a list

我無法用這個東西來構建我的想法。 我希望你能幫助我。 我有這樣的財務報告:

CONSOLIDATED BALANCE SHEETS - USD ($) $ in Millions Sep. 28, 2019 Sep. 29, 2018
0                                     Current assets:            NaN           NaN
1                           Cash and cash equivalents          48844         25913
2                               Marketable securities          51713         40388
3                            Accounts receivable, net          22926         23186
4                                         Inventories           4106          3956
5                        Vendor non-trade receivables          22878         25809
6                                Other current assets          12352         12087
7                                Total current assets         162819        131339
8                                 Non-current assets:            NaN           NaN
9                               Marketable securities         105341        170799
10                 Property, plant and equipment, net          37378         41304
11                           Other non-current assets          32978         22283
12                           Total non-current assets         175697        234386
13                                       Total assets         338516        365725
14                               Current liabilities:            NaN           NaN
15                                   Accounts payable          46236         55888
16                          Other current liabilities          37720         33327
17                                   Deferred revenue           5522          5966
18                                   Commercial paper           5980         11964
19                                          Term debt          10260          8784
20                          Total current liabilities         105718        115929
21                           Non-current liabilities:            NaN           NaN
22                                          Term debt          91807         93735
23                      Other non-current liabilities          50503         48914
24                      Total non-current liabilities         142310        142649
25                                  Total liabilities         248028        258578
26                      Commitments and contingencies                             
27                              Shareholders’ equity:            NaN           NaN
28  Common stock and additional paid-in capital, $...          45174         40201
29                                  Retained earnings          45898         70400
30      Accumulated other comprehensive income/(loss)           -584         -3454
31                         Total shareholders’ equity          90488        107147
32         Total liabilities and shareholders’ equity         338516        365725

這是 pandas Dataframe 從 excel 讀取的。 我想 - 使用一些算法 - 得到這個 output:

CONSOLIDATED BALANCE SHEETS - USD ($) $ in Millions Sep. 28, 2019 Sep. 29, 2018
0                          Cash and cash equivalents          48844         25913
1                               Total current assets         162819        131339
2                 Property, plant and equipment, net          37378         41304
3                           Total non-current assets         175697        234386
4                                       Total assets         338516        365725
5                                   Accounts payable          46236         55888
6                          Total current liabilities         105718        115929
                                          Total debt         108047        114483
7                      Total non-current liabilities         142310        142649
8                                  Total liabilities         248028        258578
9                         Total shareholders’ equity          90488        107147

基本上問題是,使用給定的鍵值,在 DataFrame 的第一列中搜索並返回每個匹配的行。 只有一個 dataframe 很容易,因為鍵值與搜索的值完全相同。 但實際上並非如此。 我有數千份報告,其中搜索的值略有不同。 例如:鍵值 = Cash ,df 中的值 = Cash and Cash equivalents ,鍵值 = net sales ,df 中的值 = net revenue到目前為止我嘗試了什么? 我已經嘗試過fuzzywuzzy模塊,但有時它不能正常工作。 有任何想法嗎?

處理這種搜索的一種方法是添加分類名稱以使其更容易縮小范圍。 如果你想知道當前資產的總和,你可以提取'Class 1'作為當前資產,'flg'作為總資產,並且使用str.contains()來進行模糊搜索是個好主意。 注意:列名在創建代碼時已更改。

df.replace('NaN', np.NaN, inplace=True)
df.rename(columns={'CONSOLIDATED BALANCE SHEETS - USD ($) $ in Millions':'accounts','Sep. 28, 2019':'this_year','Sep. 29, 2018':'last_year'}, inplace=True)
df['NO'] = np.arange(len(df))
df['Class1'] = df['accounts'][df.isnull().any(axis=1)]
df['Class1'] = df['Class1'].fillna(method='ffill')
df['flg'] = np.where(df['accounts'].str.contains(r'^(Total)'), 'total', 'items')

df
|    | accounts                                          |   this_year |   last_year |   NO | Class1                        | flg   |
|---:|:--------------------------------------------------|------------:|------------:|-----:|:------------------------------|:------|
|  0 | Current assets:                                   |         nan |         nan |    0 | Current assets:               | items |
|  1 | Cash and cash equivalents                         |       48844 |       25913 |    1 | Current assets:               | items |
|  2 | Marketable securities                             |       51713 |       40388 |    2 | Current assets:               | items |
|  3 | Accounts receivable, net                          |       22926 |       23186 |    3 | Current assets:               | items |
|  4 | Inventories                                       |        4106 |        3956 |    4 | Current assets:               | items |
|  5 | Vendor non-trade receivables                      |       22878 |       25809 |    5 | Current assets:               | items |
|  6 | Other current assets                              |       12352 |       12087 |    6 | Current assets:               | items |
|  7 | Total current assets                              |      162819 |      131339 |    7 | Current assets:               | total |
|  8 | Non-current assets:                               |         nan |         nan |    8 | Non-current assets:           | items |
|  9 | Marketable securities                             |      105341 |      170799 |    9 | Non-current assets:           | items |
| 10 | roperty, plant and equipment, net                 |       37378 |       41304 |   10 | Non-current assets:           | items |
| 11 | Other non-current assets                          |       32978 |       22283 |   11 | Non-current assets:           | items |
| 12 | Total non-current assets                          |      175697 |      234386 |   12 | Non-current assets:           | total |
| 13 | Total assets                                      |      338516 |      365725 |   13 | Non-current assets:           | total |
| 14 | Current liabilities:                              |         nan |         nan |   14 | Current liabilities:          | items |
| 15 | Accounts payable                                  |       46236 |       55888 |   15 | Current liabilities:          | items |
| 16 | Other current liabilities                         |       37720 |       33327 |   16 | Current liabilities:          | items |
| 17 | Deferred revenue                                  |        5522 |        5966 |   17 | Current liabilities:          | items |
| 18 | Commercial paper                                  |        5980 |       11964 |   18 | Current liabilities:          | items |
| 19 | Term debt                                         |       10260 |        8784 |   19 | Current liabilities:          | items |
| 20 | Total current liabilities                         |      105718 |      115929 |   20 | Current liabilities:          | total |
| 21 | Non-current liabilities:                          |         nan |         nan |   21 | Non-current liabilities:      | items |
| 22 | Term debt                                         |       91807 |       93735 |   22 | Non-current liabilities:      | items |
| 23 | Other non-current liabilities                     |       50503 |       48914 |   23 | Non-current liabilities:      | items |
| 24 | Total non-current liabilities                     |      142310 |      142649 |   24 | Non-current liabilities:      | total |
| 25 | Total liabilities                                 |      248028 |      258578 |   25 | Non-current liabilities:      | total |
| 26 | Commitments and contingencies                     |         nan |         nan |   26 | Commitments and contingencies | items |
| 27 | Shareholders’ equity:                             |         nan |         nan |   27 | Shareholders’ equity:         | items |
| 28 | Common stock and additional paid-in capital, $... |       45174 |       40201 |   28 | Shareholders’ equity:         | items |
| 29 | Retained earnings                                 |       45898 |       70400 |   29 | Shareholders’ equity:         | items |
| 30 | Accumulated other comprehensive income/(loss)     |        -584 |       -3454 |   30 | Shareholders’ equity:         | items |
| 31 | Total shareholders’ equity                        |       90488 |      107147 |   31 | Shareholders’ equity:         | total |
| 32 | Total liabilities and shareholders’ equity        |      338516 |      365725 |   32 | Shareholders’ equity:         | total |

例如: str.contains()

df[df['accounts'].str.contains('Accounts payable')]

    accounts    this_year   last_year   NO  Class1  flg
15  Accounts payable    46236.0 55888.0 15  Current liabilities:    items

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM