In this second article about SQL functions, we will look at 11
SQL-related functions commonly used in statistics: count, sum, average,
standard deviation, variance and covariance (standard deviation and variance
have three each; covariance has two).
Aside from other reasons (mentioned in the last SQL function
article) about why a DBA may need to be familiar with these tools, here is a
reason that pertains directly to what a DBA does: using "real"
statistics for performance tuning or design purposes. Oracle's array of SQL
functions enables a DBA to compute meaningful statistics about almost any set
of input data. Computing the count, sum and average of some item of interest is
very straightforward, but how do you compute the variability of that
data?
As a point of clarification, when someone refers to
statistics about a set of data, the type of statistics being discussed is that
of descriptive statistics or simple data analysis. The counterpart of
descriptive statistics is inferential statistics. Inferential statistics
typically deal with a sample from a population, and from that sample, we try to
infer or answer questions about the entire population. The answers to questions
about the population are couched in terms of probability. For our purposes,
simple descriptive statistics will suffice because we have all the data.
When looking through the list of SQL functions in the SQL
Reference documentation, you will see "POP" and "SAMP"
appended to covariance, standard deviation and variance-related functions. If
you are dealing with a set of data consisting of more than 30 to 31 elements,
observations, or readings, you can use either one of the function-name_POP
or function-name_SAMP functions. The "SAMP" functions use a
population correction factor to provide a "better" or unbiased value
of the true population parameter. In simple terms, the denominator of whichever
value is being computed uses n-1 instead of n, where "n" is the
number of data points, readings, observations, and so on.
A quick numerical example shows the practical equivalence of
dividing by n-1 versus dividing by n. A large number divided by n-1 is
practically the same (for our purposes, anyway) as dividing by n when n is
greater than 30. The value of 34.483 (1000/(30-1)) versus that of 33.333
(1000/30) is around a 3% difference. However, when n is much higher, like 100,
then the "error" falls to less than 1%. We will see this lack of a
difference in the following examples using the sample schemas that ship with
Oracle9i.
Previous
Next
Back to DBAsupport.com