[英]How do I find a specific number pattern within a column of strings and replace that value with a text version of that ordinal number?
请原谅,我是 python 的新手。 但是我正在构建一个 function,我可以用它来清理各种调查的文本。 我觉得我接近将序数的数字版本转换为文本版本,但我并不完全在那里。 这是我正在尝试构建的 function (注意,我尝试了 2 种方法在 function 的 *nbr = * 行上找到正则表达式模式,但我在下面解释了这两个错误):
import pandas as pd
from num2words import num2words
import re
my_df = pd.DataFrame({"record": [47,56,59,134,454],
"the_string": ["this is the first string",
"this is the 2nd string",
"nothing to see here",
"4th string has the date: today is the 8th",
"this has a typo10th"]})
def replace_ordinal_numbers(words):
nbr = re.findall('(\d+)[st|nd|rd|th]', words) #words.str.findall('(\d+)[st|nd|rd|th]')
newText = words
for n in nbr:
ordinal_words = num2words(n, ordinal=True)
newText = words.replace(r'\d+[st|nd|rd|th]', ordinal_words)
return newText
my_df['the_string_clean'] = replace_ordinal_numbers(str(my_df['the_string']))
错误:当我在words.str.findall
的“nbr =”行上运行 words.str.findall 时,出现错误: AttributeError: 'str' object has no attribute 'str'
当我运行re.findall
时,我能够得到一个 dataframe,但 'the_string_clean' 列不反映每一行的字符串。 相反,我得到:
record the_string the_string_clean
0 47 This is the first string "0This is the first string 1This is the 2nd string 2nothing to
see here 3 4th string has the date: today is the 8th 4This has
a typo10th"
Name: the_string, dtype: object
1 56 This is the 2nd string "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
2 59 nothing to see here "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
3 134 4th string has the "0This is the first string 1This is the 2nd string 2 nothing to
date: today is the 8th see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
4 454 this has a typo10th "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
预期 OUTPUT:这是我期待的 output:
record the_string the_string_clean
47 this is the first string this is the first string
56 this is the 2nd string this is the second string
59 nothing to see here nothing to see here
134 4th string has the date: today is the 8th fourth string has the date: today is the eighth
454 this has a typo10th this has a typotenth
我希望我足够清楚。 我是 Python 的新手,我们将不胜感激。
您可以通过使用re.sub
并在 lambda function 中调用num2words
来简化您的replace_ordinal_numbers
function 作为替换。 然后只需使用DataFrame.apply
在列上运行 function:
import pandas as pd
from num2words import num2words
import re
my_df = pd.DataFrame({"record": [47,56,59,134,454],
"the_string": ["this is the first string",
"this is the 2nd string",
"nothing to see here",
"4th string has the date: today is the 8th",
"this has a typo10th"]})
def replace_ordinal_numbers(words):
return re.sub(r'(\d+)(?:st|nd|rd|th)', lambda m: num2words(m.group(1), ordinal=True), words)
my_df['the_string'] = my_df['the_string'].apply(replace_ordinal_numbers)
my_df
Output
record the_string
0 47 this is the first string
1 56 this is the second string
2 59 nothing to see here
3 134 fourth string has the date: today is the eighth
4 454 this has a typotenth
请注意,您需要在正则表达式中使用替代(?:st|nd|rd|th)
来匹配st
、 nd
、 rd
或th
之一; 您正在使用的字符 class : [st|nd|rd|th]
将匹配包含dnrst|
中任何字符的任何字符串 .
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.