I have a large data file and I need to delete rows that end in certain letters.
Here is an example of the file I'm using:
User Name DN
MB212DA CN=MB212DA,CN=Users,DC=prod,DC=trovp,DC=net
MB423DA CN=MB423DA,OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB424PL CN=MB424PL,CN=Users,DC=prod,DC=trovp,DC=net
MBDA423 CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4 CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=netenter code here
Code I am using:
from pandas import DataFrame, read_csv
import pandas as pd
f = pd.read_csv('test1.csv', sep=',',encoding='latin1')
df = f.loc[~(~pd.isnull(f['User Name']) & f['UserName'].str.contains("DA|PL",))]
How do I use regular expression syntax to delete the words that end in "DA" and "PL" but make sure I do not delete the other rows because they contain "DA" or "PL" inside of them?
It should delete the rows and I end up with a file like this:
User Name DN
MBDA423 CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4 CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=net
First 3 rows are deleted because they ended in DA and PL.
You could use this expression
df = df[~df['User Name'].str.contains('(?:DA|PL)$')]
It will return all rows that don't end in either DA or PL.
The ?:
is so that the brackets would not capture anything. Otherwise, you'd see pandas returning the following (harmless) warning:
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Alternatively, using endswith()
and without regular expressions, the same filtering could be achieved by using the following expression:
df = df[~df['User Name'].str.endswith(('DA', 'PL'))]
As expected, the version without regular expression will be faster. A simple test, consisting of big_df
, which consists of 10001 copies of your original df
:
# Create a larger DF to get better timing results
big_df = df.copy()
for i in range(10000):
big_df = big_df.append(df)
print(big_df.shape)
>> (50005, 2)
# Without regular expressions
%%timeit
big_df[~big_df['User Name'].str.endswith(('DA', 'PL'))]
>> 10 loops, best of 3: 22.3 ms per loop
# With regular expressions
%%timeit
big_df[~big_df['User Name'].str.contains('(?:DA|PL)$')]
>> 10 loops, best of 3: 61.8 ms per loop
You can use a boolean mask whereby you check if the last two characters of User_Name
are in not ( ~
) in a set of two character endings:
>>> df[~df.User_Name.str[-2:].isin(['DA', 'PA'])]
User_Name DN
2 MB424PL CN=MB424PL, CN=Users, DC=prod, DC=trovp, DC=net
3 MBDA423 CN=MBDA423, OU=DNA, DC=prod, DC=trovp, DC=net
4 MB2ADA4 CN=MB2ADA4, OU=DNA, DC=prod, DC=trovp, DC=nete...
Instead of regular expressions
, you can use the endswith()
method to check if a string ends with a specific pattern.
Ie:
for row in rows:
if row.endswith('DA') or row.endswith('PL'):
#doSomething
You should create another df using the filtered data, and then use pd.to_csv()
to save a clean version of your file.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.