简体   繁体   English

如果列匹配特定字符串,则在数据框中删除行

[英]Drop rows in dataframe if the column matches particular string

I tried to follow the process which was mentioned here but it did not work(completely) for me, so pls point me out to any duplicates which i might be missing, so below is the requirement where i am blocked where I am trying to filter the data with the below condition before inserting data into Postgres.我试图遵循这里提到的过程,但它对我不起作用(完全),所以请指出我可能遗漏的任何重复项,所以下面是我在尝试过滤的地方被阻止的要求在将数据插入 Postgres 之前,具有以下条件的数据。

First name and last name columns should not have [Jr, Sr, I, II, etc] in it.名字和姓氏列中不应包含 [Jr、Sr、I、II 等]。 or drop the entire record/row或删除整个记录/行

columns = [
        'cust_last_nm',
        'cust_frst_nm',
        'cust_brth_dt',
        'cust_gendr_cd',
        'cust_postl_cd'
    ]
def push_to_pg_weekly(key):
    total_rows = int(a.split()[0])
    rows = 0
    for chunk in pd.read_csv(key, sep="|", header=None, chunksize=100000):
        rows += len(chunk)
        chunk = chunk.dropna(axis=0)
        chunk = chunk[np.where(
         (chunk[0].astype('str').str.len()>1) & 
         (chunk[1].astype('str').str.len()>1) &
         (chunk[4].astype('str').str.len()>4) &
         (chunk[4].astype('str').str.len()<8), True, False)]
        chunk[0] = ~chunk[0].str.contains("jr", na=False)
        chunk[1] = ~chunk[1].str.contains("jr", na=False)
        chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)
        connection = psycopg2.connect(connection details <here>)
        with connection.cursor() as cursor:
            connection.commit()

Test data that i am working on我正在处理的测试数据

jane|doe|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
jr|doe|1969-01-01|M|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
jane|sr|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY

i know i am in the right direction, but still missing something else, because when i try this我知道我的方向是正确的,但仍然缺少其他东西,因为当我尝试这个时

chunk[0] = ~chunk[0].str.contains("jr", na=False)

i get the below output: instead of False i am expecting that entire row to be dropped我得到以下输出:而不是 False 我希望整行都被删除

True|True|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
False|True|1969-01-01|M|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
True|True|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY

Expected Output:预期输出:

jane|doe|1969-01-01|F|926.0|1351127|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY

And another question i have is: can i try including multiple parameters in str.contains to filter more conditions, i tried the below two methods but it did not work, both of them yielded the true/false results as well我的另一个问题是:我可以尝试在 str.contains 中包含多个参数来过滤更多条件吗,我尝试了以下两种方法,但没有奏效,它们都产生了真/假结果

chunk[0] = ~chunk[0].str.contains("jr", “sr”, “|”, “||”, na=False)
chunk[1] = ~chunk[1].str.contains("jr", “sr”, “|”, “||”,  na=False)
or
searchfor = [‘jr’, ’sr’,’|’,’||’]
chunk[0] = ~chunk.chunk[0].str.contains('|'.join(searchfor))]
chunk[1] = ~chunk.chunk[1].str.contains('|'.join(searchfor))]

Or should i be using drop method to drop rows, any suggestions or comments will be appreciated, thanks或者我应该使用 drop 方法删除行,任何建议或意见将不胜感激,谢谢

Essentially you are forgetting to pass the boolean series (True/False) into brackets [...] or better with .loc[...] .本质上,您忘记将布尔系列(真/假)传递到括号[...]或者使用.loc[...]更好。 Instead, you are re-assigning the values within those chunk columns to the result of your conditions but not applying conditions logically to the data frame.相反,您将这些块列中的值重新分配给条件的结果,但没有将条件逻辑地应用于数据框。

Therefore, consider calling .loc[] with intersection of both those conditions:因此,考虑调用.loc[]与这两个条件的交集:

# ASSIGN BOOLEAN SERIES
fname_jr = ~chunk.loc[0].str.contains("jr", na=False)
lname_jr = ~chunk.loc[1].str.contains("jr", na=False)

# PASS INTO .loc
chunk_sub = chunk.loc[fname_jr & lname_jr]
chunk_sub

#       0    1   ...                                            9                          10
# 0  jane  doe  ...  kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=  cigna_TOKEN_ENCRYPTION_KEY
# 2  jane   sr  ...  kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=  cigna_TOKEN_ENCRYPTION_KEY

And to integrate multiple selections, call str.join to combine a list of items with pipe-delimiters:并集成多个选择,调用str.join将项目列表与管道分隔符组合在一起:

# ASSIGN BOOLEAN SERIES
fname_jr_sr = ~chunk[0].str.contains("|".join(["sr", "jr"]), na=False)
lname_jr_sr = ~chunk[1].str.contains("|".join(["sr", "jr"]), na=False)

# PASS INTO .loc
chunk_sub = chunk.loc[fname_jr_sr & lname_jr_sr]
chunk_sub
#       0    1   ...                                            9                          10
# 0  jane  doe  ...  kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=  cigna_TOKEN_ENCRYPTION_KEY

Relatedly, your np.where call is not necessary as .loc will run on boolean series.相关地,您的np.where调用不是必需的,因为.loc将在布尔系列上运行。 Be sure to also escape |一定也逃脱| with backslashes \\\\ since the pipe symbol is a string matching operator.带有反斜杠\\\\因为管道符号是一个字符串匹配运算符。 Altogether:共:

chunk = chunk.loc[(chunk[0].astype('str').str.len()>1) & 
                  (chunk[1].astype('str').str.len()>1) &
                  (chunk[4].astype('str').str.len()>4) &
                  (chunk[4].astype('str').str.len()<8) & 
                  ~chunk[0].str.contains("|".join(["sr", "jr", "\\|", "\\|\\|"]), na=False) & 
                  ~chunk[1].str.contains("|".join(["sr", "jr", "\\|", "\\|\\|"]), na=False)]

chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 python 数据框中删除多个列中不包含特定字符串的列中的重复项 - drop duplicates in a python dataframe for multiple columns not containing a particular string in a column 在 Pandas DataFrame 中删除/删除任何列中具有特定字符串的行 - Deleting/dropping rows in pandas DataFrame with particular string in ANY column 移动熊猫数据帧特定列的特定行 - shift particular rows of a particular column of pandas dataframe 如何从包含特定列中的特定字符串(多个)的 Pandas 数据框中删除行? - How to drop rows from pandas data frame that contains a particular string(multiple) in a particular column? 如何根据该行中的单元格是否删除 Dataframe 中的行。 在特定列下是空的? - How to drop rows in a Dataframe based on whether or not a cell in that row. under a particular column is empty? 在数据框的列中保留3行以获取特定值 - Keeping 3 rows for particular values in column of dataframe 如何使用列索引删除第一列中以特定字符串开头的 Dataframe 行 - How to drop Dataframe rows that start with a certain character string in the first column using column index 按DataFrame中的多列标准删除行 - Drop Rows by Multiple Column Criteria in DataFrame 按列值删除 Pandas DataFrame 中的行(文本) - Drop rows in Pandas DataFrame by Column values (text) Python 数据框 - 基于列删除连续行 - Python dataframe - drop consecutive rows based on a column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM