The following are a list of good practices I follow when querying large sets of data.
Avoid selecting all columns from a table. Queries should explicitly list each column it requires
- Changing column definitions on the underlying tables can have an adverse effect on dependencies such as views, name clashes etc
- Avoid reading and returning unnecessary data
- Can help avoid table scans
- Explicitly selecting columns allows the optimizer to better tailor the query plan
- Selecting columns could enable the optimised query to rely on an index and not need to cross reference the entire table
- creates a heap
- no clustered index
- not supported on Azure
- Create a temp table with a clustered index
- Insert the identifying keys for all the data, include other columns from the same tables used to build the initial set of keys
- Index the table for use by subsequent queries
Minimise the use of
% at the beginning of search strings, use more specific filters where possible such as
[a-e]% where possible.
% at the start of the search string will result in a complete index or table scan
Avoid scalar functions in
WHERE clauses. Calling a function can result in a full table scan as the function needs to be applied to the whole data set before the clause can be evaluated.
A simple set of steps for writing performant queries
When working with large data sets the preferred approach to writing performant queries is to minimise the use of OUTER join. Queries that use OUTER joins are often much slower than INNER joins. By separating the query as detailed below you can minimise the use of OUTER join and create efficient and performant queries that are also much easier for others to understand.
- Define a temp table to hold all of the columns required in the results; define a clustered primary key for the table based on the natural unique identifier of the data being selected
- Separate the requirements into
- write the initial
SELECTcombining as many
INNERjoins as possible and filtering the result set to the minimum possible rows; insert the results into the temp table; exclude tables that do not a useful index for this operation
- CREATE INDEXes on field combinations that support subsequent join operations (this may need to be done between the different UPDATE queries dependent on having the data for the index)
INNERjoin the temp table to each of the remaining tables and
UPDATEthe temp table with the new bits of data
SELECTthe final data from the temp table