简体   繁体   中英

Extracting a pattern from a string using python

I'm trying to read data from a column in an Excel file and then extracint user id's used in that row. So far I was able to extract user id's using the following code and then write the results to an Excel file.

import xlrd
import pandas as pd


#Input File Path
file='file1.xlsx'
workbook = xlrd.open_workbook(file)

#open first worksheet
sheet=workbook.sheet_by_index(0)

#extract details from 4th column
description = sheet.col_values(4)

my_series = pd.Series(description)
numbers = my_series.str.findall('\d+')
All_Ids = pd.to_numeric(numbers, errors='ignore')
All_Ids_mapped = [map(int, x) for x in All_Ids]
df = pd.DataFrame(All_Ids_mapped)

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('extracted_ids.xlsx', engine='xlsxwriter')

# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

But now my problem is that in the column there are many id's. So I want to extract the id's which start with the string 'user with id' For instance a string in the column looks like the following:

The user with id '123' started discussion with the user with id '456' in the discussion thread with id '5000'.

Since I'm interested only in user id's I want to update my search string to incorporate text. I tried the following but it doesn't give me the output.

  numbers=my_series.str.findall('^user with id.+\d+')

Please help me write the correct expression in str.findall .

Thank you.

using re module, I got the following result:

series = "The user with id '123' started discussion with the user with id '456' in the discussion thread with id '5000'."
>>>re.findall("user with id '\d+'", series)
["user with id '123'", "user with id '456'"]

Are these the expected matches? Since the resulting matches are ordered, it wouldn't be too hard to select one by index and extract the id.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM