简体   繁体   中英

Problem with re.split() , data extraction from a string (splitting a string)

I have been trying to split this string but it only gives me the last character of the username I want. for example

in this dataset I want to separate the username from the actual message but after doing this code-

#how can we separate users from messages 
users = []
messages = []
for message in df['user_message']:
    entry = re.split('([a-zA-Z]|[0-9])+#[0-9]+\\n', message)
    if entry[1:]:
        users.append(entry[1])
        messages.append(entry[2])
    else:
        users.append('notif')
        messages.append(entry[0])
        
df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace = True)

df.head(30)

I only get

Could someone please tell me why it only gives me the last character of the string i want to split and how I can fix it? thanks a lot. This means a lot

Splitting is not really the string operation you want here. Instead, just use str.extract directly on the user_message column:

df["username"] = df["user_message"].str.extract(r'^([^#]+)')

The above logic will extract the leading part of the user message, from the beginning, until reaching the first hash symbol.

You could do this a lot simpler, by just using string.split() and setting the maxsplit to 1. See the example below.

Note that regex is very useful, but it's very easy to get incorrect results with it. I advise to use a online regex validator if you really need to use it. As for the actual regex, your + is in the wrong place. You need move it inside the group. I used regex101.com for testing...

([a-zA-Z0-9]+)#[0-9]+\\n

string.split() example:

line = "keikeo#2720\nAdded a recipient.\n\n\n"

user, message = line.split('\n', maxsplit=1)
print(user)
print(message)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM