I have a dataframe that contains userdata. There is a column that includes filenames that users have accessed. The filenames look like this:
blah-blah-blah/dss_outline.pdf
doot-doot/helper_doc.pdf
blah-blah-blah/help_file.pdf
My goal is to chop off everything after and including the / so that I can just look at the top-level programs people are examining (which the numerous different files are organized under).
So, I'm having two challenges:
1 - How do I 'grab' everything up to the '/'? I've been looking at regex, but I'm having a hard time writing the correct expression.
2 - How do I replace all of the filenames with the concatenated filename? I found that I could use df['Filename'] = df['Filename'].str.split('/')[0]
to grab the proper portion, but it won't apply across the series object. That's the logic of what I want to do, but I can't figure out how to do it.
Thanks
You may use \\/.*$
to match the part you don't need and remove it: DEMO
This matches a forward slash and any following character till the end of the string (be careful to use a multiline flag if your engine needs it!).
OR you may use ^[^/]+
to match the part you want and extract it: DEMO
This matches any consecutive characters except /
from the beginning of a string (again, multiline needed!).
You have lot of solutions handy:
split()
method: >>> df
col1
0 blah-blah-blah/dss_outline.pdf
1 doot-doot/helper_doc.pdf
2 blah-blah-blah/help_file.pdf
>>> df['col1'].str.split('/', 1).str[0].str.strip()
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Name: col1, dtype: object
apply()
+ split()
>>> df['col1'].apply(lambda s: s.split('/')[0])
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Name: col1, dtype: object
rsplit()
+ str[0]
to strip off the desired: >>> df['col1'].str.rsplit('/').str[0]
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Name: col1, dtype: object
extract()
: >>> df['col1'] = df['col1'].str.extract('([^/]+)')
>>> df
col1
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
OR
# df.col1.str.extract('([^/]+)')
Use df.replace
df.replace('\/.*$','',regex=True)
col
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Use series.apply()
:
>>> import pandas
>>> data = {'filename': ["blah-blah-blah/dss_outline.pdf", "doot-doot/helper_doc.pdf", "blah-blah-blah/help_file.pdf"]}
>>> df = pandas.DataFrame(data=data)
>>> df
filename
0 blah-blah-blah/dss_outline.pdf
1 doot-doot/helper_doc.pdf
2 blah-blah-blah/help_file.pdf
>>> def get_top_level_from(string):
... return string.split('/')[0]
...
>>> series = df["filename"]
>>> series
0 blah-blah-blah/dss_outline.pdf
1 doot-doot/helper_doc.pdf
2 blah-blah-blah/help_file.pdf
Name: filename, dtype: object
>>> series.apply(get_top_level_from)
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Name: filename, dtype: object
Code:
def get_top_level_from(string):
return string.split('/')[0]
results = df["filename"].apply(get_top_level_from)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.