如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本？

Question

请原谅，我是 python 的新手。 但是我正在构建一个 function，我可以用它来清理各种调查的文本。 我觉得我接近将序数的数字版本转换为文本版本，但我并不完全在那里。 这是我正在尝试构建的 function （注意，我尝试了 2 种方法在 function 的 *nbr = * 行上找到正则表达式模式，但我在下面解释了这两个错误）：

import pandas as pd
from num2words import num2words
import re

my_df = pd.DataFrame({"record": [47,56,59,134,454],
                      "the_string": ["this is the first string",
                                     "this is the 2nd string",
                                     "nothing to see here",
                                     "4th string has the date: today is the 8th",
                                     "this has a typo10th"]})

def replace_ordinal_numbers(words):
    nbr = re.findall('(\d+)[st|nd|rd|th]', words) #words.str.findall('(\d+)[st|nd|rd|th]')
    
    newText = words
    for n in nbr:
        ordinal_words = num2words(n, ordinal=True)
        newText = words.replace(r'\d+[st|nd|rd|th]', ordinal_words)
    return newText

my_df['the_string_clean'] = replace_ordinal_numbers(str(my_df['the_string']))

错误：当我在words.str.findall的“nbr =”行上运行 words.str.findall 时，出现错误： AttributeError: 'str' object has no attribute 'str'当我运行re.findall时，我能够得到一个 dataframe，但 'the_string_clean' 列不反映每一行的字符串。 相反，我得到：

    record  the_string                  the_string_clean
0   47      This is the first string    "0This is the first string 1This is the 2nd string 2nothing to 
                                        see here 3 4th string has the date: today is the 8th 4This has 
                                        a typo10th"
Name: the_string, dtype: object
1   56      This is the 2nd string      "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
2   59       nothing to see here        "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
3   134      4th string has the         "0This is the first string 1This is the 2nd string 2 nothing to
             date: today is the 8th     see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
4   454      this has a typo10th        "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object

预期 OUTPUT：这是我期待的 output：

record    the_string                                 the_string_clean
47        this is the first string                   this is the first string
56        this is the 2nd string                     this is the second string
59        nothing to see here                        nothing to see here
134       4th string has the date: today is the 8th  fourth string has the date: today is the eighth
454       this has a typo10th                        this has a typotenth

我希望我足够清楚。 我是 Python 的新手，我们将不胜感激。

Answer 1

您可以通过使用re.sub并在 lambda function 中调用num2words来简化您的replace_ordinal_numbers function 作为替换。 然后只需使用DataFrame.apply在列上运行 function：

import pandas as pd
from num2words import num2words
import re

my_df = pd.DataFrame({"record": [47,56,59,134,454],
                      "the_string": ["this is the first string",
                                     "this is the 2nd string",
                                     "nothing to see here",
                                     "4th string has the date: today is the 8th",
                                     "this has a typo10th"]})

def replace_ordinal_numbers(words):
    return re.sub(r'(\d+)(?:st|nd|rd|th)', lambda m: num2words(m.group(1), ordinal=True), words)

my_df['the_string'] = my_df['the_string'].apply(replace_ordinal_numbers)

my_df

Output

   record                                       the_string
0      47                         this is the first string
1      56                        this is the second string
2      59                              nothing to see here
3     134  fourth string has the date: today is the eighth
4     454                             this has a typotenth

请注意，您需要在正则表达式中使用替代(?:st|nd|rd|th)来匹配st 、 nd 、 rd或th之一； 您正在使用的字符 class ： [st|nd|rd|th]将匹配包含dnrst|中任何字符的任何字符串 .

如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本？

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-03-09 04:35:08

如何在一列字符串中找到特定的数字模式并将该值替换为该序数的文本版本？

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-03-09 04:35:08

解决方案1
2 已采纳 2021-03-09 04:35:08