[英]Perform matching in between Two lists in Python
I have two Tables.我有两张桌子。 I want to compare two columns and want to get matches row counts and row numbers.
我想比较两列并希望得到匹配的行数和行数。 How can I get the expected result using Python.
如何使用 Python 获得预期结果。 df1:
df1:
Name![]() |
Score![]() |
Year![]() |
---|---|---|
Pat![]() |
82 ![]() |
1990 ![]() |
Chris![]() |
38 ![]() |
1993 ![]() |
Pat![]() |
92 ![]() |
1994 ![]() |
Noris![]() |
88 ![]() |
1997 ![]() |
Mit![]() |
62 ![]() |
1999 ![]() |
Chen![]() |
58 ![]() |
1996 ![]() |
df2: df2:
Applicant![]() |
---|
Pat![]() |
Chris![]() |
Meet![]() |
Expected result预期结果
Applicant![]() |
Match (Y/N)![]() |
Matched Row reference![]() |
Count![]() |
---|---|---|---|
Pat![]() |
Y![]() |
1,3 ![]() |
2 ![]() |
Chris![]() |
Y![]() |
2 ![]() |
1 ![]() |
Meet![]() |
N ![]() |
NA![]() |
0 ![]() |
Approach based on Pandas outer merge基于Pandas外合并的方法
Code代码
import pandas as pd
import numpy as np
def process(df1, df2):
' Overall function for generating desired output '
def create_result(df, columns = ["Match (Y/N)", "Matched Row reference", "Count"]):
'''
Creates the desired columns of df2
Input:
df - Dataframe from groupby
columns - column names for df2
Output:
Pandas Series corresponding to row in df2
'''
cnt = df['Name'].count() # Number of items in group
if cnt > 0:
# Convert index to comma delimited list, numbered from 1 (i.e. int(x) + 1)
indexes = ','.join(str(int(x) + 1) for x in df.index.to_list())
else:
indexes = "NA" # empty dataframe
lst = ["Y" if cnt > 0 else 'N',
indexes,
df.shape[0] if cnt > 0 else 0]
return pd.Series(lst, index = columns)
# Merge df1 with df2 but
# add method from [to keep index after merge](https://stackoverflow.com/questions/11976503/how-to-keep-index-when-using-pandas-merge/11982843#11982843)
# to have the index of df1 in the merge result
return (df1
.reset_index()
.merge(df2, left_on = "Name", right_on = 'Applicant', how = "outer")
.set_index('index')
.groupby(['Applicant'])
.apply(lambda grp_df: create_result(grp_df)))
Usage用法
from io import StringIO
s = '''Name Score Year
Pat 82 1990
Chris 38 1993
Pat 92 1994
Noris 88 1997
Mit 62 1999
Chen 58 1996'''
df1 = pd.read_csv(StringIO(s), sep = '\t', engine = 'python')
s = '''Applicant
Pat
Chris
Meet'''
df2 = pd.read_csv(StringIO(s), sep = '\t', engine = 'python')
from pprint import pprint as pp
pp(process(df1, df2)) # process and pretty print result
Output Output
Match (Y/N) Matched Row reference Count
Applicant
Chris Y 2 1
Meet N NA 0
Pat Y 1,3 2
I would use numpy and pandas for this.我会为此使用 numpy 和 pandas 。 Because I belive that Pandas is the great libraries for dealing with huge data.
因为我相信 Pandas 是处理海量数据的优秀库。 Although you do not great number of data, I would still recommend you to use pandas.
尽管您没有大量数据,但我仍然建议您使用 pandas。
For information about pandas https://pandas.pydata.org/有关 pandas https 的信息://pandas.pydata.org/
You are able to create list file with pandas您可以使用 pandas 创建列表文件
data = {'Name': ListForName,
'Score': ListForScore,
'Year': ListForScore}
For more information about creating a list.有关创建列表的更多信息。 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
I would use a basic for loop for matching.我会使用基本的 for 循环进行匹配。 For example.
例如。
match = 0
for i in range(0, FirstList):
for j in range(0, SecondList):
if(FirstList['Colunm'].iloc[i] == SecondList['Colunm'].iloc[j)):
match += 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.