[英]Use Vlook_up using Python
I have four columns in my excel data file: 我的excel数据文件中有四列:
CUI ICD9/10 Out Lookup
C0161894 39 4000001 C0000005
C0029730 398 4000002 C0000039
C0176693 398 4000003 C0000052
C0029730 3989 4000004 C0000074
I want to match 4th column from the 1st column and get 3rd column as the output using python. 我想匹配第1列的第4列,并使用python获得第3列作为输出。 As the data is large so indirectly i want to use vLookups, but here i dont have any specific value.
由于数据很大所以间接我想使用vLookups,但在这里我没有任何具体的价值。 I need to search in whole column
我需要搜索整列
If I understand you correctly, you want to compare the values in column 4 and column 1, and if they are equal, output a new column with the value from column 3. 如果我理解正确,您希望比较第4列和第1列中的值,如果它们相等,则输出一个新列,其中包含第3列中的值。
To do this, simply use np.where
as follows: 要做到这一点,只需使用
np.where
,如下所示:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'CUI':['C0161894','C0029730','C0176693','C0029730','C0000074'],
'ICD9/10':[39,398,398,3989,3989],
'Out':[4000001,4000002,4000003,4000004,4000005],
'Lookup':['C0000005','C0000039','C0000052','C0000074','C0000074']})
df1['Match'] = np.where(df1.Lookup == df1.CUI, df1.Out, 'No Match')
Output: 输出:
CUI ICD9/10 Lookup Out Match
0 C0161894 39 C0000005 4000001 No Match
1 C0029730 398 C0000039 4000002 No Match
2 C0176693 398 C0000052 4000003 No Match
3 C0029730 3989 C0000074 4000004 No Match
4 C0000074 3989 C0000074 4000005 4000005
Edit: 编辑:
In response to your comment, you can use the chunksize
parameter in pandas.read_csv
to read in only parts of your dataframe: 在回复您的评论时,您可以使用
pandas.read_csv
的chunksize
参数来仅读取数据帧的部分内容:
For data
in csv as follows: 对于csv中的
data
如下:
CUI ICD9/10 Lookup Out
C0161894 39 C0000005 4000001
C0029730 398 C0000039 4000002
C0176693 398 C0000052 4000003
C0029730 3989 C0000074 4000004
C0000074 3989 C0000074 4000005
See https://stackoverflow.com/a/25962187/2254228 : You can do: 请参阅https://stackoverflow.com/a/25962187/2254228 :您可以这样做:
chunksize = 1000
for chunk in pd.read_csv(data, chunksize=chunksize):
# process(chunk) using the solution above
# Output Chunk to new csv using `pd.to_csv('new_data')`
Edit2: Here I have compiled full sample code for you. Edit2:这里我已经为您编译了完整的示例代码。 Replace the file
data
and new_data
with whatever your data file is called and replace the file paths with your file paths. 用调用的数据文件替换文件
data
和new_data
,并用文件路径替换文件路径。 This will avoid any memory errors from your datafile being too big. 这样可以避免数据文件中的任何内存错误太大。
For some a sample data.csv
: 对于一些示例
data.csv
:
CUI ICD9/10 Lookup Out
C0161894 39 C0000005 4000001
C0029730 398 C0000039 4000002
C0176693 398 C0000052 4000003
C0029730 3989 C0000074 4000004
C0000074 3989 C0000074 4000005
Create a target csv file new_data
as an empty csv file to store your new data frame: 创建目标csv文件
new_data
作为空csv文件来存储新数据框:
CUI ICD9/10 Lookup Out
Then import the old data, splitting it into chunk, where chunksize = the number of lines of the file to read in: 然后导入旧数据,将其拆分为块,其中chunksize =要读入的文件的行数:
# Read in line by line = set chunksize = 1
chunksize = 1
# Open New File
with open("Pathtonewfile/new_data.csv", "w") as f:
# Iterate over the old data.csv file, reading in one line
for chunk in pd.read_csv('Pathtooldfile/data.csv', index_col = False, chunksize=chunksize):
# Carry out Lookup Calculation as above
chunk['Match'] = np.where(chunk.Lookup == chunk.CUI, chunk.Out, 'No Match')
# Write the new dataframe chunk to "new_data.csv"
chunk.to_csv(f, header=False, index=False,
cols=['CUI','ICD9/10','Out','Lookup'],
mode = 'a')
This gives you an output in new_data.csv
as follows: 这将为您提供
new_data.csv
的输出,如下所示:
CUI ICD9/10 Lookup Out Match
0 C0161894 39 C0000005 4000001 No Match
1 C0029730 398 C0000039 4000002 No Match
2 C0176693 398 C0000052 4000003 No Match
3 C0029730 3989 C0000074 4000004 No Match
4 C0000074 3989 C0000074 4000005 4000005
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.