简体   繁体   English

python 通过正则表达式和列表理解从字符串中提取数字值

[英]python extract digit values from string via regular expressions and list comprehension

I want to extract this我想提取这个

3.76    2.35    3.30    5.08     NaN    8.44    10.00
3.76    2.35    3.30    4.99    6.63    8.42    10.00
1.50    1.50    1.60    2.00    2.60    3.35    3.85
NaN      NaN    NaN     NaN     NaN     0.00    0.00

from the following return of an bs4 operation:从 bs4 操作的以下返回:

[<td class="font-bold">Ergebnis je Aktie (unverwässert, nach Steuern)</td>, <td>3,76</td>, 
<td>2,35</td>, <td>3,30</td>, <td>5,08</td>, <td>-</td>, <td>8,44</td>, <td>10,00</td>, <td class="font-
bold">Ergebnis je Aktie (verwässert, nach Steuern)</td>, <td>3,76</td>, <td>2,35</td>, <td>3,30</td>,
 <td>4,99</td>, <td>6,63</td>, <td>8,42</td>, <td>10,00</td>, <td class="font-bold">Dividende je 
Aktie</td>, <td>1,50</td>, <td>1,50</td>, <td>1,60</td>, <td>2,00</td>, <td>2,60</td>, <td>3,35</td>,
 <td>3,85</td>, <td class="font-bold">Gesamtdividendenausschüttung in Mio.</td>, <td>-</td>, <td>-</td>,
 <td>-</td>, <td>-</td>, <td>-</td>, <td>0,00</td>, <td>0,00</td>]

I tried something like我尝试了类似的东西

def get_table_entries(element, len_colums):    
        #--------------------------------
        #
        _re_digits = re.compile("-?\d+\.?\d+")
        #--------------------------------
        # find all table entries
        entries = []
        temp = element.findAll("td")
        temp = str(temp)
        #print(temp)
        #--------------------------------
        # replace elements and extract digits from string
        temp = temp.replace('.', '') 
        temp = temp.replace(',', '.')

        print(temp)
        entries += [ n for n in _re_digits.findall(temp)]
        #--------------------------------
        # reshape output array to fit original table shape and return entries
        print(entries)
        entries = np.reshape(entries, (-1, len_colums))

        return entries

But this solution also kicks the minus in <td>-</td> I want to transform into NaN.但是这个解决方案也消除了<td>-</td>我想转换为 NaN 的减号。 But still when i keep the minus and replace it via temp = temp.replace('-', 'NaN') I will get an error in the following list comprehension.但是,当我保留减号并通过temp = temp.replace('-', 'NaN')替换它时,我会在以下列表理解中出现错误。

Perhaps simplest will be to define a helper function:也许最简单的方法是定义一个助手 function:

def to_float(s): 
    if s == "-": 
        return float("nan") 
    else: 
        return float(s.replace(",", ".")) 

And then just write a basic loop over cells:然后只需在单元格上编写一个基本循环:

values = []
for elem in soup.find_all("td"): 
    try: 
        values.append(to_float(elem.text)) 
    except ValueError: 
        pass 

Now it will be easy to convert to numpy array of desired shape:现在可以很容易地转换为所需形状的 numpy 数组:

>>> np.array(values).reshape(-1, 7)
array([[ 3.76,  2.35,  3.3 ,  5.08,   nan,  8.44, 10.  ],
       [ 3.76,  2.35,  3.3 ,  4.99,  6.63,  8.42, 10.  ],
       [ 1.5 ,  1.5 ,  1.6 ,  2.  ,  2.6 ,  3.35,  3.85],
       [  nan,   nan,   nan,   nan,   nan,  0.  ,  0.  ]])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在python中使用正则表达式提取字符串 - How to extract string with regular expressions in python 使用正则表达式和python从单个字符串中检索不同的值 - retrieve different values from a single string with regular expressions and python 如何匹配 python 中正则表达式中的字符串列表中的任何字符串? - How to match any string from a list of strings in regular expressions in python? Python 中的列表/字典理解,使用字符串中的键和值更新字典 - List/Dict comprehension in Python to update a dictionary with keys and values from a string 从字符串列表中提取和过滤值 python - extract and filter values from string list python 如何使用Python正则表达式从字符串中提取多个模式? - How to extract more than one patterns from a string using Python Regular Expressions? Python正则表达式:从文本文件中提取关键字后的元组列表 - Python-Regular expressions: extract a list of tuples after a keyword from a text file 如何从 txt 文件中提取字符串(数字)并使用 python 中的正则表达式转换为整数 - How to extract string (numbers) from txt file and convert to integers using regular expressions in python Python 2.7正则表达式:从文本字符串中提取第三个十六进制值 - Python 2.7 Regular Expressions: Extract 3rd Hex Value from Text String 在Python中从字符串中选择文本的正则表达式 - Regular expressions to select text from string in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM