[英]sql equivalent of R query
I have two data sets author_data and paper_author 我有两个数据集author_data和paper_author
author_data: author_data:
author_id author_name author_affiliation
25 William H. Nailon
37 P. B. Littlewood Cavendish Laboratory|Cambridge University
44 A. Kuroiwa Department of Molecular Biology
paper_author: paper_author:
paper_id author_id author_name author_affiliation
1 521630 Ayman Kaheel Cairo Microsoft Innovation Lab
1 972575 Mahmoud Refaat Cairo Microsoft Innovation Lab
I have run the following query in R 我已经在R中运行以下查询
author_data[which(author_data$author_id %in% paper_author$author_id &
author_data$author_name %in% paper_author$author_name &
author_data$author_affiliation %in% paper_author$author_affiliation), ]
That is, I want to find out the matches between author_data and paper_author for which the three columns author_id
, author_name
and author_affiliation
match. 也就是说,我想找出author_data和paper_author之间的匹配项,
author_id
, author_name
和author_affiliation
这三列author_id
匹配。
I have written a query to get this result in sql but I am not getting it right.The query which I have tried is 我已经编写了一个查询来在sql中获得此结果,但我没有得到正确的结果。我尝试过的查询是
statement <- "select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
FROM author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation"
through this query I am getting more rows than the rows in author_data and the query should fetch data which first of all would be subset of author_data. 通过该查询,我得到的行数比author_data中的行数更多,并且查询应获取首先是author_data子集的数据。 I am not able to figure out what is wrong as I am naive at sql.
我不懂sql,所以无法弄清楚出了什么问题。
What is wrong with this query? 此查询有什么问题?
Thanks 谢谢
There is a difference between which
in R and join
in SQL. 有之间的差异
which
在R和join
的SQL。 While which
will effectively subset given data frame, join
will return all rows where join
condition is met. 虽然
which
将有效子集给出的数据帧, join
将返回所有行, join
条件得到满足。 I am almost sure, that in your case you have multiple occurences of combination author_id, author_name, author_affiliation
in paper_author
. 我几乎可以确定,在您的情况下
author_id, author_name, author_affiliation
会多次出现author_id, author_name, author_affiliation
paper_author
。 As a result, rows in author_data
are multiplied by rows in paper_author
. 结果,
author_data
中的行与author_data
中的行paper_author
。
Your query was almost correct, you need to add distinct
or group by
or use exists
: 您的查询几乎是正确的,您需要添加非
distinct
或group by
或使用exists
:
Distinct: 不同:
select
distinct
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
Group by: 通过...分组:
select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
group by
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
You can also use exists
: 您也可以使用
exists
:
select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
where
exists (select 1 from paper_author where
author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
)
Try this. 尝试这个。
SELECT author_data.author_id,author_data.author_name,author_data.author_affiliation
FROM author_data, paper_author
WHERE author_data.author_id = paper_author.author_id
AND author_data.author_name=paper_author.author_name
AND author_data.author_affiliation=paper_author.author_affiliation
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.