相当于R查询的sql

Question

I have two data sets author_data and paper_author 我有两个数据集author_data和paper_author

author_data: author_data：

author_id       author_name          author_affiliation
 25         William H. Nailon                                                                    
 37         P. B. Littlewood        Cavendish Laboratory|Cambridge University
 44         A. Kuroiwa               Department of Molecular Biology

paper_author: paper_author：

paper_id     author_id      author_name      author_affiliation
  1          521630         Ayman Kaheel     Cairo Microsoft Innovation Lab
  1          972575       Mahmoud Refaat     Cairo Microsoft Innovation Lab

I have run the following query in R 我已经在R中运行以下查询

author_data[which(author_data$author_id %in% paper_author$author_id &
                  author_data$author_name %in% paper_author$author_name & 
                  author_data$author_affiliation %in% paper_author$author_affiliation), ]

That is, I want to find out the matches between author_data and paper_author for which the three columns author_id , author_name and author_affiliation match. 也就是说，我想找出author_data和paper_author之间的匹配项， author_id ， author_name和author_affiliation这三列author_id匹配。

I have written a query to get this result in sql but I am not getting it right.The query which I have tried is 我已经编写了一个查询来在sql中获得此结果，但我没有得到正确的结果。我尝试过的查询是

statement <- "select
              author_data.author_id,
              author_data.author_name,
              author_data.author_affiliation
        FROM author_data
        INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation"

through this query I am getting more rows than the rows in author_data and the query should fetch data which first of all would be subset of author_data. 通过该查询，我得到的行数比author_data中的行数更多，并且查询应获取首先是author_data子集的数据。 I am not able to figure out what is wrong as I am naive at sql. 我不懂sql，所以无法弄清楚出了什么问题。

What is wrong with this query? 此查询有什么问题？

Thanks 谢谢

Answer 1

There is a difference between which in R and join in SQL. 有之间的差异which在R和join的SQL。 While which will effectively subset given data frame, join will return all rows where join condition is met. 虽然which将有效子集给出的数据帧， join将返回所有行， join条件得到满足。 I am almost sure, that in your case you have multiple occurences of combination author_id, author_name, author_affiliation in paper_author . 我几乎可以确定，在您的情况下author_id, author_name, author_affiliation会多次出现author_id, author_name, author_affiliation paper_author 。 As a result, rows in author_data are multiplied by rows in paper_author . 结果， author_data中的行与author_data中的行paper_author 。

Your query was almost correct, you need to add distinct or group by or use exists : 您的查询几乎是正确的，您需要添加非distinct或group by或使用exists ：

Distinct: 不同：

select
   distinct
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
   INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation

Group by: 通过...分组：

select
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
   INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation
group by
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation

You can also use exists : 您也可以使用exists ：

select
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
where
   exists (select 1 from paper_author where
       author_data.author_id = paper_author.author_id
       AND author_data.author_name = paper_author.author_name
       AND author_data.author_affiliation = paper_author.author_affiliation
       )

Answer 2

Try this. 尝试这个。

SELECT author_data.author_id,author_data.author_name,author_data.author_affiliation
FROM author_data, paper_author
WHERE author_data.author_id = paper_author.author_id 
AND author_data.author_name=paper_author.author_name 
AND author_data.author_affiliation=paper_author.author_affiliation

相当于R查询的sql

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-03-20 17:28:30

解决方案2
0 2014-03-20 15:27:34

相当于R查询的sql

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-03-20 17:28:30

解决方案2 0 2014-03-20 15:27:34

解决方案1
1 已采纳 2014-03-20 17:28:30

解决方案2
0 2014-03-20 15:27:34