[英]python extract digit values from string via regular expressions and list comprehension
I want to extract this我想提取这个
3.76 2.35 3.30 5.08 NaN 8.44 10.00
3.76 2.35 3.30 4.99 6.63 8.42 10.00
1.50 1.50 1.60 2.00 2.60 3.35 3.85
NaN NaN NaN NaN NaN 0.00 0.00
from the following return of an bs4 operation:从 bs4 操作的以下返回:
[<td class="font-bold">Ergebnis je Aktie (unverwässert, nach Steuern)</td>, <td>3,76</td>,
<td>2,35</td>, <td>3,30</td>, <td>5,08</td>, <td>-</td>, <td>8,44</td>, <td>10,00</td>, <td class="font-
bold">Ergebnis je Aktie (verwässert, nach Steuern)</td>, <td>3,76</td>, <td>2,35</td>, <td>3,30</td>,
<td>4,99</td>, <td>6,63</td>, <td>8,42</td>, <td>10,00</td>, <td class="font-bold">Dividende je
Aktie</td>, <td>1,50</td>, <td>1,50</td>, <td>1,60</td>, <td>2,00</td>, <td>2,60</td>, <td>3,35</td>,
<td>3,85</td>, <td class="font-bold">Gesamtdividendenausschüttung in Mio.</td>, <td>-</td>, <td>-</td>,
<td>-</td>, <td>-</td>, <td>-</td>, <td>0,00</td>, <td>0,00</td>]
I tried something like我尝试了类似的东西
def get_table_entries(element, len_colums):
#--------------------------------
#
_re_digits = re.compile("-?\d+\.?\d+")
#--------------------------------
# find all table entries
entries = []
temp = element.findAll("td")
temp = str(temp)
#print(temp)
#--------------------------------
# replace elements and extract digits from string
temp = temp.replace('.', '')
temp = temp.replace(',', '.')
print(temp)
entries += [ n for n in _re_digits.findall(temp)]
#--------------------------------
# reshape output array to fit original table shape and return entries
print(entries)
entries = np.reshape(entries, (-1, len_colums))
return entries
But this solution also kicks the minus in <td>-</td>
I want to transform into NaN.但是这个解决方案也消除了<td>-</td>
我想转换为 NaN 的减号。 But still when i keep the minus and replace it via temp = temp.replace('-', 'NaN')
I will get an error in the following list comprehension.但是,当我保留减号并通过temp = temp.replace('-', 'NaN')
替换它时,我会在以下列表理解中出现错误。
Perhaps simplest will be to define a helper function:也许最简单的方法是定义一个助手 function:
def to_float(s):
if s == "-":
return float("nan")
else:
return float(s.replace(",", "."))
And then just write a basic loop over cells:然后只需在单元格上编写一个基本循环:
values = []
for elem in soup.find_all("td"):
try:
values.append(to_float(elem.text))
except ValueError:
pass
Now it will be easy to convert to numpy array of desired shape:现在可以很容易地转换为所需形状的 numpy 数组:
>>> np.array(values).reshape(-1, 7)
array([[ 3.76, 2.35, 3.3 , 5.08, nan, 8.44, 10. ],
[ 3.76, 2.35, 3.3 , 4.99, 6.63, 8.42, 10. ],
[ 1.5 , 1.5 , 1.6 , 2. , 2.6 , 3.35, 3.85],
[ nan, nan, nan, nan, nan, 0. , 0. ]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.