The following are a list of good practices I follow when querying large sets of data.
SELECT *
Avoid selecting all columns from a table. Queries should explicitly list each column it requires
- Changing column definitions on the underlying tables can have an adverse effect on dependencies such as views, name clashes etc
- Avoid reading and returning unnecessary data
- Can help avoid table scans
- Explicitly selecting columns allows the optimizer to better tailor the query plan
- Selecting columns could enable the optimised query to rely on an index and not need to cross reference the entire table
Temp tables
Avoid SELECT INTO
- creates a heap
- no clustered index
- not supported on Azure
Use CREATE TABLE
- Create a temp table with a clustered index
- Insert the identifying keys for all the data, include other columns from the same tables used to build the initial set of keys
- Index the table for use by subsequent queries
Wild cards
Minimise the use of %
at the beginning of search strings, use more specific filters where possible such as _
and [a-e]%
where possible. %
at the start of the search string will result in a complete index or table scan
Scalar functions
Avoid scalar functions in JOIN
and WHERE
clauses. Calling a function can result in a full table scan as the function needs to be applied to the whole data set before the clause can be evaluated.
A simple set of steps for writing performant queries
When working with large data sets the preferred approach to writing performant queries is to minimise the use of OUTER join. Queries that use OUTER joins are often much slower than INNER joins. By separating the query as detailed below you can minimise the use of OUTER join and create efficient and performant queries that are also much easier for others to understand.
- Define a temp table to hold all of the columns required in the results; define a clustered primary key for the table based on the natural unique identifier of the data being selected
- Separate the requirements into
INNER
andOUTER
joins - write the initial
SELECT
combining as manyINNER
joins as possible and filtering the result set to the minimum possible rows; insert the results into the temp table; exclude tables that do not a useful index for this operation - CREATE INDEXes on field combinations that support subsequent join operations (this may need to be done between the different UPDATE queries dependent on having the data for the index)
INNER
join the temp table to each of the remaining tables andUPDATE
the temp table with the new bits of dataSELECT
the final data from the temp table