# 2 Sampling strategy

ARIC implemented a stratified sampling design, with unequal probabilities of sampling among strata. Over time, the sampling probabilities for the strata were adjusted in order to increase or decrease the number of cases sampled from each stratum, or to just adjust case-load to reasonable levels for field sites.

The two variables needed for any analysis with a complex sampling design are:

- the sampling strata; and
- the sampling weights, or how many people each observation in the dataset represents in the population.

## 2.1 Sampling strata

The strata for sampling cases CHD/MI and HF were based off:

- center;
- race;
- gender;
- ICD discharge code group; and
- date of discharge.

The variable `CODESTRAT`

in the `s14evt`

dataset is the stratum variable for CHD/MI surveillance data. The variable `SAMSTRAT2`

in the `hfsocc1`

dataset is the stratum variable for HF surveillance.

### 2.1.1 Modified sampling strata variables

Some of the levels of `CODESTRAT`

have few or no observations. This occurrence creates a problem for calculating variances within strata, since the estimates will be either highly variable or non-existent (i.e., if there are 0 observations in a stratum).

There are several ways to address this issue statistically, one of which is to combine strata with low counts with other similar strata.

Therefore, modified versions of `CODESTRAT`

were created using this method, in order to prevent ARIC investigators from having to worry about which method of handling low-count strata to use. These modified variables are:

`NESTVAR2`

: used when analyzing data among 35-74 year olds from 1987 - 2014`NESTVAR_COMBN`

: used when analyzing data from 35-84 year olds from 2005 - 2014

There is no similar derived variable for HF surveillance. The smallest counts in the strata (`SAMSTRAT2`

) for HF are 2 observations, so if any subpopulation analysis is done, you will likely need to create a variable that combines certain low-count strata, or use other methods that handle this situation. Examples are not using that stratum in calculating the standard error or to use the average contribution of a single stratum to the standard error for that particular stratum. See here for examples in `R`

.

## 2.2 Sampling weights

The main sampling weight variable for CHD/MI analyses is `SAMWT_TRIM`

. This variable is a modified version of the original sampling variable, `SAMWT`

, with the ceiling for values set to 15.88. `SAMWT_TRIM`

is the recommended variable to use, since strata with low numbers of observations that have very high weights can cause computational issues.

The sampling weight variable for HF analyses is `SAMWTHF`

. There is no trimmed version of this weight.