[英]Extracting a pattern from a string using python
我正在尝试从Excel文件中的列读取数据,然后在该行中使用多余的用户ID。 到目前为止,我已经能够使用以下代码提取用户ID,然后将结果写入Excel文件。
import xlrd
import pandas as pd
#Input File Path
file='file1.xlsx'
workbook = xlrd.open_workbook(file)
#open first worksheet
sheet=workbook.sheet_by_index(0)
#extract details from 4th column
description = sheet.col_values(4)
my_series = pd.Series(description)
numbers = my_series.str.findall('\d+')
All_Ids = pd.to_numeric(numbers, errors='ignore')
All_Ids_mapped = [map(int, x) for x in All_Ids]
df = pd.DataFrame(All_Ids_mapped)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('extracted_ids.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
但是现在我的问题是,该列中有许多ID。 所以我想提取以字符串'user with id'开头的id,例如列中的字符串如下所示:
The user with id '123' started discussion with the user with id '456' in the discussion thread with id '5000'.
由于我只对用户ID感兴趣,因此我想更新搜索字符串以合并文本。 我尝试了以下操作,但没有给出输出。
numbers=my_series.str.findall('^user with id.+\d+')
请帮助我在str.findall
编写正确的表达式。
谢谢。
使用re
模块,我得到以下结果:
series = "The user with id '123' started discussion with the user with id '456' in the discussion thread with id '5000'."
>>>re.findall("user with id '\d+'", series)
["user with id '123'", "user with id '456'"]
这些是预期的比赛吗? 由于结果匹配是有序的,因此按索引选择一个并提取ID并不难。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.