Though "NOT EXISTS" and "NOT IN" sounds similar there is quite a lot of difference between them. To start with check out the blog post by Mladen.
In continuation to what Mladen has already written I thought I would show the differences it makes on the Execution Plan and the IO / Time when we use NOT EXISTS or NOT IN in our queries. Let's see few of the differences between them.
Case 1: Lets use them in columns which are declared as NOT NULL
SET NOCOUNT ON
GO
CREATE TABLE PackageInformation
(
Sno INT IDENTITY(1,1) PRIMARY KEY,
PackageID INT NOT NULL,
PackageName VARCHAR(20)
)
GO
-- I am using the random records generator which I wrote few days back to populate data into this table.
-- generating 10000 records
INSERT INTO PackageInformation (PackageID, PackageName)
SELECT CAST(RAND(CHECKSUM(NEWID())) * 10000 AS INT),
dbo.udf_StringGenerator('A', 20)
GO 10000
CREATE TABLE ChildTable
(
RID INT IDENTITY(1,1) PRIMARY KEY,
PackageID INT NOT NULL
)
GO
-- Lets take some 40% (approx 4000 records) from the PackageInformation table to populate this table
INSERT INTO ChildTable (PackageID)
SELECT PackageID
FROM dbo.PackageInformation
TABLESAMPLE (40 PERCENT); -- This would work only on SQL Server version 2005 or above
GO
--Press Control + M to display the Actual Execution Plan of the queries
Result:
I think it would be safe to say that we should use NOT EXISTS instead of NOT IN as it seems to work as expected by us in all the scenarios which we saw in this post.
--Cleanup
DROP TABLE ChildTable
GO
DROP TABLE PACKAGEINFORMATION
GO
In continuation to what Mladen has already written I thought I would show the differences it makes on the Execution Plan and the IO / Time when we use NOT EXISTS or NOT IN in our queries. Let's see few of the differences between them.
Case 1: Lets use them in columns which are declared as NOT NULL
SET NOCOUNT ON
GO
CREATE TABLE PackageInformation
(
Sno INT IDENTITY(1,1) PRIMARY KEY,
PackageID INT NOT NULL,
PackageName VARCHAR(20)
)
GO
-- I am using the random records generator which I wrote few days back to populate data into this table.
-- generating 10000 records
INSERT INTO PackageInformation (PackageID, PackageName)
SELECT CAST(RAND(CHECKSUM(NEWID())) * 10000 AS INT),
dbo.udf_StringGenerator('A', 20)
GO 10000
CREATE TABLE ChildTable
(
RID INT IDENTITY(1,1) PRIMARY KEY,
PackageID INT NOT NULL
)
GO
-- Lets take some 40% (approx 4000 records) from the PackageInformation table to populate this table
INSERT INTO ChildTable (PackageID)
SELECT PackageID
FROM dbo.PackageInformation
TABLESAMPLE (40 PERCENT); -- This would work only on SQL Server version 2005 or above
GO
Let's write a query to list Package details from PackageInformation table which is not present in ChildTable.
--Clear out the cache (DONT TRY THIS IN PRODUCTION ENVIRONMENT)
DBCC FREEPROCCACHE
GO
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
--Press Control + M to display the Actual Execution Plan of the queries
--Query1: Using NOT IN
SELECT PackageID, PackageName
FROM dbo.PackageInformation
WHERE PackageID NOT IN (SELECT PackageID FROM ChildTable)
GO
--Query2: Using NOT EXISTS
SELECT PackageID, PackageName
FROM dbo.PackageInformation
WHERE NOT EXISTS
(
SELECT PackageID FROM ChildTable
WHERE ChildTable.PackageID = PackageInformation.PackageID
)
GO
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
GO
Result:
i) Query using NOT IN : Returned 4690 records
ii) Query using NOT EXISTS : Returned 4690 records
iii) Let's see the Actual execution plan for both the queries.
iv) Let's also check on the logical reads and CPU time taken for these queries.
So when the column is declared as NOT NULL then both NOT IN and NOT EXISTS seems to perform the same way.
Case 2: Let's change the PackageID column as NULL
ALTER TABLE dbo.PackageInformation
ALTER COLUMN PackageID INT NULL
GO
ALTER TABLE dbo.ChildTable
ALTER COLUMN PackageID INT NULL
GO
--Lets insert some 100 null values into the PackageInformation Table
INSERT INTO PackageInformation
SELECT NULL,
dbo.udf_StringGenerator('A', 5)
GO 100
Let's run the same query which we used in Case 1 to list Package details from PackageInformation table which is not present in ChildTable.
--Clear out the cache (DONT TRY THIS IN PRODUCTION ENVIRONMENT)
DBCC FREEPROCCACHE
GO
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
--Query1: Using NOT IN
SELECT PackageID, PackageName
FROM dbo.PackageInformation
WHERE PackageID NOT IN (SELECT PackageID FROM ChildTable)
GO
--Query2: Using NOT EXISTS
SELECT PackageID, PackageName
FROM dbo.PackageInformation
WHERE NOT EXISTS
(
SELECT PackageID FROM ChildTable
WHERE ChildTable.PackageID = PackageInformation.PackageID
)
GO
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
GO
Result:
i) Query using NOT IN : Returned 4690 records (It hasn't considered those 100 new NULL records which we added!!)
ii) Query using NOT EXISTS : Returned 4790 records
iii) Let's see the Actual execution plan for both the queries.
iv) Let's also check on the logical reads and CPU time taken for these queries.
So when column is declared as NULL then NOT IN seems to generate a pretty complicated execution plan and does NUMEROUS number of logical reads more than NOT EXISTS. So the winner here is NOT EXISTS.
Case 3: Adding NULL values into ChildTable
INSERT INTO ChildTable (PackageID)
SELECT NULL
Result:
i) Query using NOT IN : Returned 0 records!
ii) Query using NOT EXISTS : Returned 4790 records
So if the Subquery returns even one NULL then NOT IN operator would not return any result which isn't right. So again the winner is NOT EXISTS.
I think it would be safe to say that we should use NOT EXISTS instead of NOT IN as it seems to work as expected by us in all the scenarios which we saw in this post.
--Cleanup
DROP TABLE ChildTable
GO
DROP TABLE PACKAGEINFORMATION
GO
Comments