python 通过正则表达式和列表理解从字符串中提取数字值

Question

I want to extract this我想提取这个

3.76    2.35    3.30    5.08     NaN    8.44    10.00
3.76    2.35    3.30    4.99    6.63    8.42    10.00
1.50    1.50    1.60    2.00    2.60    3.35    3.85
NaN      NaN    NaN     NaN     NaN     0.00    0.00

from the following return of an bs4 operation:从 bs4 操作的以下返回：

[<td class="font-bold">Ergebnis je Aktie (unverwässert, nach Steuern)</td>, <td>3,76</td>, 
<td>2,35</td>, <td>3,30</td>, <td>5,08</td>, <td>-</td>, <td>8,44</td>, <td>10,00</td>, <td class="font-
bold">Ergebnis je Aktie (verwässert, nach Steuern)</td>, <td>3,76</td>, <td>2,35</td>, <td>3,30</td>,
 <td>4,99</td>, <td>6,63</td>, <td>8,42</td>, <td>10,00</td>, <td class="font-bold">Dividende je 
Aktie</td>, <td>1,50</td>, <td>1,50</td>, <td>1,60</td>, <td>2,00</td>, <td>2,60</td>, <td>3,35</td>,
 <td>3,85</td>, <td class="font-bold">Gesamtdividendenausschüttung in Mio.</td>, <td>-</td>, <td>-</td>,
 <td>-</td>, <td>-</td>, <td>-</td>, <td>0,00</td>, <td>0,00</td>]

I tried something like我尝试了类似的东西

def get_table_entries(element, len_colums):    
        #--------------------------------
        #
        _re_digits = re.compile("-?\d+\.?\d+")
        #--------------------------------
        # find all table entries
        entries = []
        temp = element.findAll("td")
        temp = str(temp)
        #print(temp)
        #--------------------------------
        # replace elements and extract digits from string
        temp = temp.replace('.', '') 
        temp = temp.replace(',', '.')

        print(temp)
        entries += [ n for n in _re_digits.findall(temp)]
        #--------------------------------
        # reshape output array to fit original table shape and return entries
        print(entries)
        entries = np.reshape(entries, (-1, len_colums))

        return entries

But this solution also kicks the minus in <td>-</td> I want to transform into NaN.但是这个解决方案也消除了<td>-</td>我想转换为 NaN 的减号。 But still when i keep the minus and replace it via temp = temp.replace('-', 'NaN') I will get an error in the following list comprehension.但是，当我保留减号并通过temp = temp.replace('-', 'NaN')替换它时，我会在以下列表理解中出现错误。

Answer 1

Perhaps simplest will be to define a helper function:也许最简单的方法是定义一个助手 function：

def to_float(s): 
    if s == "-": 
        return float("nan") 
    else: 
        return float(s.replace(",", "."))

And then just write a basic loop over cells:然后只需在单元格上编写一个基本循环：

values = []
for elem in soup.find_all("td"): 
    try: 
        values.append(to_float(elem.text)) 
    except ValueError: 
        pass

Now it will be easy to convert to numpy array of desired shape:现在可以很容易地转换为所需形状的 numpy 数组：

>>> np.array(values).reshape(-1, 7)
array([[ 3.76,  2.35,  3.3 ,  5.08,   nan,  8.44, 10.  ],
       [ 3.76,  2.35,  3.3 ,  4.99,  6.63,  8.42, 10.  ],
       [ 1.5 ,  1.5 ,  1.6 ,  2.  ,  2.6 ,  3.35,  3.85],
       [  nan,   nan,   nan,   nan,   nan,  0.  ,  0.  ]])

python 通过正则表达式和列表理解从字符串中提取数字值

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-04-17 08:16:11

python 通过正则表达式和列表理解从字符串中提取数字值

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-04-17 08:16:11

解决方案1
2 已采纳 2020-04-17 08:16:11