# DesignDiagnostics.m: Want to know how efficient is your experimental design?

Are you undecided as to which is the best design for your task? Block or event-related? Jittered or regular? Do you need a rest period? Is this design good, or is that other one better? Well, this tool can help you decide!

**Design diagnostics** is a Matlab script (I’ve included all non-standard dependencies) that you can easily use to get (1) information about potential multicollinearity in your design matrix (e.g., is there any significant correlation between your variables); (2) the efficiency of your design/contrasts (which you can compare across designs to pick the best one); (3) simulations.

You can download here the **DesignDiagnostics** scripts (zip file)

In the folder there is also a **Demo.m** script. If you just want to see what this thing does, just type Demo in the matlab terminal and after a little bit (very much depending on your computer’s prowess..) you should get in the terminal results for (A) *Multicolinearity analysis*; (B) *Efficiency of your Design*/Contrasts; (C) The result of the *simulations*. You also get a number of (very sexy..) figures with heatmaps, squiggly lines, and all the sort of real science-looking stuff.

If you use it for your work, let me know if you have any feedback please: monti@ucla.edu

Here is the help from that function:

% The aim of this function is to run some diagnostics on your experimental

% design and contrasts — you know, to see how good they are! The real aim

% is to make sure your design is (1) the best possible in terms of its

% efficiency, and (2) does not suffer from too much multicollinearity.

% Considering how underpowered fMRI is, and how small the effect sizes

% typically are, this is a simple way of trying to make sure you get the

% most bang for your (supervisor’s) buck before you ever scan a single

% participant.

%

% NOTE 1: Efficiency is scale-less so the right way of using this function

% is to come up with different designs and see how efficient each is, and

% then pick the most efficient one.

%

% NOTE 2: There are many questions that you should evaluate when you are

% setting up your next experiment; for instance “should I have a blocked

% design? Should I have a slow or fast event related design? What is the

% best ordering of the trials/conditions?” ecc. There are 2 aspects to each

% of these questions:

% (1) The Psychological answer: for that, sorry, but you are on your own.

% Pilot the task (on yourself too), see how you like it and if you are

% coming up with unwanted or un-anticiapted strategies and the like.

% (2) The Statistical answer: this is what the present function is handy

% for. As a guideline, you want designs with the highest efficiencies

% (both overall and for the specific contrasts you are interested in)

% and you want to avoid having lots of collinearity because that can

% lead to unstable estimates (which gan even go in the wrong

% direction!)

%

% This function will do 2 things for you (print the results to screen and

% generate some sexy figures):

%

% 1. MULTICOLLINEARITY DIAGNOSTICS:

% It will calculate various measures diagnostic of multicollinearity. If

% your design is substantially multicollinear then the collinear regressors

% might be “wild” and even be in the wrong direction, so in that case you

% probably want to fix it. Examples of how to fix it might include: (1)

% eliminating one of the two collinear regressors; (2) performing some kind

% of data reduction (e.g., PCA) and then feed the factors you derive from

% that as regressors; (3) in some circumstances, particularly if you are

% only interested in group level statistics, you might still pull it off

% even if you do have collinear designs (and if you are an FSLer the

% “de-weight outliers” option will help with decreasing the importance of

% very crazy estimates — though, if you can fix it with (1) or (2), why

% not?)

%

% 2. SIMULATIONS:

% This part of the function will take your design, assign a random “true”

% ‘activation’ value to each regressor (i.e., a true beta) and then run a

% number of simulations (i.e., GLMs) and see how well it can recover, on

% average, that “true” beta value. Along with the average estimated betas

% (which you’d hope would be pretty close to the “true” beta), it will also

% tell you the variance associated with the estimates, the average t-value

% for each beta, the average F-value for the whole model as well as the

% adjusted R2.

%

% Have tons of fun!

%

% USAGE:

%

% [simdata,multicol] = DesignDiagnostics(X,C,tr)

%

% INPUT:

% – X is a t (rows) by p (columns) design matrix with each

% row being a sample (e.g. timepoint/volume) and each column

% being a regressor (e.g., task).

% NOTE: the script assumes that you have not convolved

% your regressors yet. So your matrix should be comprised of

% 1s and 0s. If you don’t like the (default) double gamma

% convolution kernel you can do the convolution yourself

% (i.e., feed in a convolved X matrix) and comment out the

% convolution below in the code.

%

% – C is a p (rows) by c (columns) contrast matrix. This is

% the matrix specifying which comparisons you want to perform.

%

% – tr is the TR you expect to use

%

% OUTPUT

% – simdata is a p (rows) by s (columns) matrix containing the

% results of the simulations. The columns give the following

% summary information over the run simulations (which match

% the printed output of this function):

% Regressor#, True Beta, Average Estimated Beta, Average

% Standard Error, Average T-value.

%

% – multicol is a structure that contains all sorts of

% information concerning multicollinearity in your design including

% the diagnostic measures (VIF and CondInd — see below for

% an explanation of each)

%

% NOTE: The data stored in simdata and multicol is exactly

% the same data that gets printed to the screen, it’s just in

% case you want to do something fancy with it.

%

% There are 3 main diagnostics for collinearity that are used in this

% function: pearson correlations, Variance Inflation Factor, Condition

% Index)

%

%

% Here is an explanation of VIF and CI if you want to understand your

% results better

%

%

% Variance Inflation Factor,VIF, (a measure calculated for each variable) is

% simply the reciprocal of tolerance. It measures the degree to which

% the interrelatedness of the variable with other predictor variables

% inflates the variance of the estimated regression coefficient for

% that variable. Hence the square root of the VIF is the degree to

% which the collinearity has increased the standard error for that

% variable. A high VIF value indicates high multicollinearity

% of that variable with other independents and instability of the

% regression coefficient estimation process. There are no statistical

% tests to test for multicollinearity using the tolerance or VIF measures.

% VIF=1 is ideal and many authors use VIF=10 as a suggested upper limit

% for indicting a definite multicollinearity problem for an individual

% variable (VIF=10 inflates the Standard Error by 3.16). Some would

% consider VIF=4 (doubling the Standard Error) as a minimum for

% indicated a possible multicollinearity problem.

% CALCULATION: given a regression Y=B0+X1B1+X2B2+…+XnBn, do:

% 1. Try to predict each variable from the other predictor variables:

% X1=X2A2+X3A3+…+XnAn

% 2. Calculate VIF (after having done #1) as 1/(1-R2j) where R2j

% is the R2 for the regression on predictor variable X1 (you

% have to do this for each variable, so you will get a VIF for

% each variable). Note that (1-R2j) is referred to as the

% tolerance.

% INTERPRETATION: The square root of the VIF tells

% you how much larger the standard error is, compared with what it

% would be if that variable were uncorrelated with the other predictor

% variables in the model. So if you have a VIF of 4,

% it means the SE has been doubled (i.e., sqrt(4)) because of the

% presence of multicollinearity, with respect to what it would be

% if it were not collinear with one or more other variables.

%

% Condition Index values are calculated from the eigenvalues for a rescaled

% crossproduct X’X matrix. Hence these measures are not for individual

% variables (like the tolerance and VIF measures) but are for individual

% dimensions/components/factors and measure of the amount of the variability

% it accounts for in the rescaled crossproduct X’X matrix. The rescaled

% crossproduct X’X matrix values are obtained by dividing each original

% value by the square root of the sum of squared original values for that

% column in the original matrix, including those for the intercept. This

% yields an X’X matrix with ones on the main diagonal. Eigenvalues close

% to 0 indicate dimensions which explain little variability. A wide spread

% in eigenvalues indicates an ill-conditioned crossproduct matrix, meaning

% there is a problem with multicollinearity. A condition index is calculated

% for each dimension/component/factor by taking the square root of ratio of

% the largest eigenvalue divided by the eigenvalue for the dimension.

% A common rule of thumb is that a condition index over 15 indicates a

% possible multicollinearity problem and a condition index over 30 suggests

% a serious multicollinearity problem. Since each dimension is a linear

% combination of the original variables the analyst using OLS regression is

% not able to merely exclude the problematic dimension. Hence a guide is

% needed to determine which variables are associated with the problematic

% dimension. ****Another way of putting it is: (1) decompose the

% correlation matrix into a linear combination of variables (so, you

% make the first linear combination taking the dimension that

% has the most variance, then you take the second dimension that has

% the most variance, contingent on it being independent of the first

% dimension, hten you take the third dimension, contingent on it being

% independent of the first and second …); (2) for each linear

% combination the igenvalue represents how much variance it explains;

% (3) for each linear combination calculate the CI as the square root

% of the ratio between the maximum of the obtained igenvalues, and the

% igenvalue for each given linear combination; if the maximum is much

% much bigger than other eigenvalues it means that there is one linear

% combination that explains most of the variability of our original

% variables, hence they are likely to be highly correlated.

%

%

% Script by Martin M Monti (monti@psych.ucla.edu http://montilab.psych.ucla.edu)

% Version 07/22/2014