简体   繁体   中英

How do I remove the rest of a string after a character in a dataframe column?

I have a dataframe that contains userdata. There is a column that includes filenames that users have accessed. The filenames look like this:

blah-blah-blah/dss_outline.pdf  
doot-doot/helper_doc.pdf
blah-blah-blah/help_file.pdf

My goal is to chop off everything after and including the / so that I can just look at the top-level programs people are examining (which the numerous different files are organized under).

So, I'm having two challenges:

1 - How do I 'grab' everything up to the '/'? I've been looking at regex, but I'm having a hard time writing the correct expression.

2 - How do I replace all of the filenames with the concatenated filename? I found that I could use df['Filename'] = df['Filename'].str.split('/')[0] to grab the proper portion, but it won't apply across the series object. That's the logic of what I want to do, but I can't figure out how to do it.

Thanks

You may use \\/.*$ to match the part you don't need and remove it: DEMO
This matches a forward slash and any following character till the end of the string (be careful to use a multiline flag if your engine needs it!).

OR you may use ^[^/]+ to match the part you want and extract it: DEMO
This matches any consecutive characters except / from the beginning of a string (again, multiline needed!).

You have lot of solutions handy:

1) Just with split() method:

>>> df
                             col1
0  blah-blah-blah/dss_outline.pdf
1        doot-doot/helper_doc.pdf
2    blah-blah-blah/help_file.pdf


>>> df['col1'].str.split('/', 1).str[0].str.strip()
0    blah-blah-blah
1         doot-doot
2    blah-blah-blah

Name: col1, dtype: object

2) You can use apply() + split()

>>> df['col1'].apply(lambda s: s.split('/')[0])
0    blah-blah-blah
1         doot-doot
2    blah-blah-blah
Name: col1, dtype: object

3) You can use rsplit() + str[0] to strip off the desired:

>>> df['col1'].str.rsplit('/').str[0]
0    blah-blah-blah
1         doot-doot
2    blah-blah-blah
Name: col1, dtype: object

4) You can use pandas native regex With extract() :

>>> df['col1'] = df['col1'].str.extract('([^/]+)')
>>> df
             col1
0  blah-blah-blah
1       doot-doot
2  blah-blah-blah

OR
# df.col1.str.extract('([^/]+)')

Use df.replace

df.replace('\/.*$','',regex=True)


              col
0  blah-blah-blah
1       doot-doot
2  blah-blah-blah

Use series.apply() :

>>> import pandas
>>> data = {'filename': ["blah-blah-blah/dss_outline.pdf", "doot-doot/helper_doc.pdf", "blah-blah-blah/help_file.pdf"]}
>>> df = pandas.DataFrame(data=data)
>>> df
                         filename
0  blah-blah-blah/dss_outline.pdf
1        doot-doot/helper_doc.pdf
2    blah-blah-blah/help_file.pdf
>>> def get_top_level_from(string):
...     return string.split('/')[0]
... 
>>> series = df["filename"]
>>> series
0    blah-blah-blah/dss_outline.pdf
1          doot-doot/helper_doc.pdf
2      blah-blah-blah/help_file.pdf
Name: filename, dtype: object
>>> series.apply(get_top_level_from)
0    blah-blah-blah
1         doot-doot
2    blah-blah-blah
Name: filename, dtype: object

Code:

def get_top_level_from(string):
    return string.split('/')[0]

results = df["filename"].apply(get_top_level_from)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM