Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
Vowels MJ. Prespecification of Structure for the Optimization of Data Collection and Analysis. Collabra: Psychology. 2023;9(1). doi:10.1525/collabra.71300 Methodology and Research Practice Prespecification of Structure for the Optimization of Data Collection and Analysis Matthew J. Vowels Institute of Psychology, University of Lausanne, Switzerland Keywords: Markovicity, data collection, conditional independence, causality, path modeling, structural equation modeling https://doi.org/10.1525/collabra.71300 Collabra: Psychology Vol. 9, Issue 1, 2023 Data collection and research methodology represents a critical part of the research pipeline. On the one hand, it is important that we collect data in a way that maximises the validity of what we are measuring, which may involve the use of long scales with many items. On the other hand, collecting a large number of items across multiple scales results in participant fatigue, and expensive and time consuming data collection. It is therefore important that we use the available resources optimally. In this work, we consider how the representation of a theory as a causal/structural model can help us to streamline data collection and analysis procedures by not wasting time collecting data for variables which are not causally critical for answering the research question. This not only saves time and enables us to redirect resources to attend to other variables which are more important, but also increases research transparency and the reliability of theory testing. To achieve this, we leverage structural models and the Markov conditional independency structures implicit in these models, to identify the substructures which are critical for a particular research question. To demonstrate the benefits of this streamlining we review the relevant concepts and present a number of didactic examples, including a real-world example. Imagine you want to estimate the effect of a therapeutic are either causally necessary or which can be omitted from treatment on depressive symptoms, and how this effect the data collection process. This liberates resources to ei- may be mediated via another variable, say, therapeutic al- ther improve the quality of the remaining scales (e.g., by us- liance. One might suspect that these variables are linked ing scales with a more comprehensive set of items), and/or through a complex causal web involving multiple other fac- to reduce participant fatigue by shortening the duration of tors - but which of these other factors are necessary, in a questionnaire and using these resources to increase the terms of data collection, for estimating the main effect of overall sample size. Indeed, concerns about inadequate sta- interest? Collecting too many variables increases the cost tistical power are growing in response to the replication cri- 2–5 and time required to complete data collection, having an sis , and researchers are thus encouraged to make sure impact on participant fatigue as well as draining valuable they have sufficient data to estimate the effects of interest. project resources. Conversely, collecting too few may ren- Furthermore, even if a researcher decides not to un- der the results of the statistical tests invalid. In this man- dertake any analyses (perhaps they are not able to collect uscript, we describe how to identify those variables which data, for whatever reason) the process of reflecting a theory are strictly necessary to arrive at unbiased answers to pre- graphically nonetheless helps with transparency, repro- specified questions. Of course, other interests may influ- ducibility, and the meaningfulness of subsequent interpre- ence data collection (such as subsequent applications and tation. Psychology has been accused of being ‘not even usage), but knowing what is strictly necessary allows one to wrong’ on the basis that the theories are too vague to be make more informed decisions about what to include. adequately tested. By reflecting our theories in a graphical In this paper, we argue that the data collection and re- form, we thus improve the clarity and reduce the one-to- search project methodology can be optimized by specifying many relationship between our theories and our statistical the causal structure underlying a theory in graphical form. models. Translating our theories to graphs also forces re- Using rules from the structural modeling framework, one searchers to think carefully about the underlying process, can then use the graph to identify variables or scales which and the concomitant implications for data collection. The a Correspondence concerning this article should be addressed to Matthew J. Vowels, Institute of Psychology, University of Lausanne, Switzerland. E-mail: firstname.lastname@example.org Prespecification of Structure for the Optimization of Data Collection and Analysis specification can then be made explicit, preregistered , TERMINOLOGY AND CONCEPTUAL OVERVIEW and compared unambiguously against other work. This, in In this work, we assume that psychologists/researchers are turn, facilitates more precise replication by subsequent re- principally concerned with estimating a particular causal searchers, as well as a clearer understanding of the re- effect (e.g., the effect of treatment on an outcome). Indeed, lationships between the hypotheses being tested and the this goal aligns with the causal nature of psychological the- assumptions and theory which underpin the model specifi- 8–10 ories (which, in general, describe causal processes), as well cation and results . as the goal to design and implement effective interventions In this work we show how four related concepts - con- which improve peoples’ lives. As such, we assume that a re- ditional independencies, Markov Blankets, projection, and searcher wishes to test a particular hypothesis which con- causal identification - can be used to judiciously shrink the cerns a (causal) effect size of interest. number of variables required to answer a research ques- We will refer to a number of objects which deserve to be tion, without impacting downstream analyses and without defined up-front. In Figure 1 we present examples of these impacting the congruity of the model with the underlying objects for reference. Firstly, we assume that there exists theory. The process is not data-driven and is not the same some (potentially highly complex) real-world Data Gener- as seeking model ‘parsimony’ - our approach does not fun- ating Process (DGP). According to our existing theories, we damentally change the complexity of the underlying wish to model this DGP in such a way that we are able to processes reflected by the ‘full’ model. Instead, using a set meaningfully represent it. One option for doing so involves of rules which are consistent with the assumptions of the the use of Structural Equation Models (SEMs). SEM provides original graph being specified, our initial graphical repre- us with a powerful and popular statistical framework to sentation can be reduced to focus in on the effects we re- unambiguously reflect and test causal theories and rela- ally care about. Thus whilst the complexity of the statistical 10,11,14–17 tionships . In particular, the SEM can be repre- model reduces, it does so without introducing any addi- sented in an intuitive graphical (and therefore visual) way, tional simplifying assumptions beyond those which already thereby specifying our domain knowledge about the DGP. existed in the original theory. The graphical representation of the theory, which we The techniques are relevant to a broad range of problems will refer to as the graphical or structural model, can be amenable to specification in graphical form. For example, used early on in the research pipeline to inform the data the didactic examples given by Rohrer involve health collection methodology, by helping us specify which con- problems and work satisfaction, genetics and child’s de- structs we need to measure. Furthermore, early specifica- pressiveness, or educational attainment and income. Ad- tion of a statistical model helps us with preregistration and ditionally, social psychologists interested in complex, me- research transparency . Such transparency is increasingly diated processes and multiple baseline control variables important in the fields of psychology and social science, could also benefit from the proposal presented here. To this where attention has been drawn to numerous problems end, as well as providing a set of experimental results to with theory testing, research methodology, and analytical demonstrate the performance characteristics in a general 5,14,19–23 practice . and non-domain-specific way, we also provide an example As we will discuss, we will apply the rules of a type of application to a graph used in organizational behavior . graphical model known as a Directed Acyclic Graph (DAG) to Our hope is that researchers can use the techniques pre- the graphical representations of our SEM. These rules are sented in this work to optimize their data collection and actually more general than those specific to SEM, because analysis in a more transparent way which is tailored specif- whilst SEM assumes linear relationships between variables, ically to the particular relationships of interest. the rules we use are applicable to problems with almost ar- We begin by motivating the specification of our theories bitrarily non-linear relationships. Using these rules, and in in graphical form. Then, we introduce the relevant statis- combination with a Research Question expressed as a set of tical/structural concepts needed to understand the process target causal effects of interest, we can reduce its complex- for reducing this model. We then walk through a number ity (which we refer to as the Reduced SEM) without sacri- of didactic examples, comparing an assumed ‘real-world’ ficing our ability to estimate what we care about for a par- or Data Generating Process (DGP) against the minimal re- ticular research question or hypothesis. This reduced model quired model for estimating a set of causal effects of in- then determines which variables we are required to collect terest. We also provide the associated multiple linear re- data for. In some cases, we may not need to use the typical gression models where a single regression model can be SEM estimation techniques to answer our research ques- used to provide the same information, and present a real- tions, and a simple multiple regression model may suffice. world example. In supplementary, we also provide simu- However, it is worth emphasising that this work is not con- lation results to demonstrate that the approach does not cerned with the estimation of the coefficients themselves, introduce bias, and in some cases can improve model fit but rather how we can use the graphical modeling rules to and reduce standard error. Finally, in the supplementary we simplify the representation of a theory, and in turn stream- also provide the code for an automatic tool for reducing line our data collection and study design. the graph (along with a description of the associated algo- rithm). The code for reproducing the simulations as well as the automatic tool are provided here: https://github.com/ matthewvowels1/minSEM. Collabra: Psychology 2 Prespecification of Structure for the Optimization of Data Collection and Analysis Figure 1. Top level terminology. Note. We assume (left) there exists a real-world causal Data Generating Process (DGP), which we wish to model using a structural model. This structural model can be represented graphically (see SEM graph for the DGP in the figure). Using our proposed approach, this SEM can be simplified in such a way that does not jeopardise the estimation of a particular (causal) effect size which is of interest to our research. For example, we may be interested in estimating path coefficients/effects , , and in a mediation model. Finally, the effect sizes may be estimated using straightforward regression models. in principle (the acquisition of real-time, multi-modal data MOTIVATION may in some cases be infeasible). Furthermore, a single theory may admit multiple statis- In this section we provide two principal motivations for our tical models, each of which tests something slightly dif- proposed approach: Statistical power, and model under- or ferent but all of which are valid given the malleability of mis-specification. In light of these motivations, we then the underlying theory. Few psychological theories make it provide a top-level overview of our proposal. clear which variables are necessary to include as control variables, for instance. And yet, the inclusion of different STATISTICAL POWER AND MODEL SPECIFICATION control variables can have a large impact on the resulting 24–26 Psychological research is frequently underpowered , parameter estimates, and it is not usually clear how these control variables are chosen or how they relate to the tested and the theory and analysis are often poorly specified 14,31,32 6,10,11,14,20 theory . As an example, in medical studies older . The studies are underpowered to the extent that the sample sizes are insufficient to test a target hy- patients may be more likely to choose medication over surgery, but also be less likely to recover. This makes age pothesis. For example, for a minimum assumed true effect a key confounder that must be controlled/adjusted for to size of interest, it is generally recommended that enough evaluate the treatment effects. However, perhaps there ex- data are collected to yield a power of 80%, meaning that there is an 80% probability that we will find a statistically ist other, less obvious confounders which we have not col- lected and which we can therefore not adjust for. Some vari- significant result (at a given threshold such as 0.05) . Re- ables may need to be controlled for but be unattainable, searchers are thus encouraged to ensure that their studies some may be inconsequential (and can be omitted without are adequately powered, and have been encouraged to do so 24,28 for some time . However, depending on the complex- consequence), and still others may actually be detrimen- tally biasing the model. In order to determine which control ity of the theory under test, researchers may need to mea- variables should or should not be included, and to therefore sure a large number of constructs, each with a large number avoid what is known as structural misspecification , re- of items. For example, depending on the format, the IPIP- 29,30 NEO Big 5 inventory contains between 120-300 items searchers need to somehow formalise their theories. and therefore takes considerable time to complete. Besides THE PROPOSED SOLUTION the associated cost and time required to measure constructs using such comprehensive scales, the participants may also 1 With respect to statistical power, there exists a need for experience fatigue, lowering the quality of the responses . compromise - maximising the quality of a survey such that The second problem of under-specification has it measures all that we need, at a sufficient level of quality, prompted meta-researchers to describe research in psy- 6 for a sufficient number of participants. Of course, we ac- chology as ‘not even wrong’ . That is to say, if the theories knowledge that there often exist multiple goals for studies are too vague to be specified unambiguously, then it is not in which new data will be collected - they may have either clear what it is that any particular statistical test is actually confirmatory or exploratory research questions, or both; testing. If we are considered with understanding the real- they may wish to compare and contrast multiple competing time process of dyadic support, for instance, we might need hypothesized structures; they may want to ‘future-proof’ to develop a statistical model which can capture the intri- the study, such that additional variables are collected with cacies of back-and-forth, multi-modal (verbal, para-verbal, a view that they may be necessary for answering research non-verbal) interactions between partners. Without unam- questions which are not yet specified. biguously reflecting the complexity of the process in our At the same time, and in order to correctly specify a statistical model, it is not clear what a typical model in psy- model with respect to a psychological theory, it is impor- chology (e.g., a multiple linear regression model) is really tant that psychologists consider not only the structure be- doing for us. The structural representation of this process tween the primary constructs central to their theory, but can be a helpful aid to understand (a) what data we need also the full data-generating process (DGP) which leads to to collect, and (b) whether the data can even be collected Collabra: Psychology 3 Prespecification of Structure for the Optimization of Data Collection and Analysis a set of observations. The theory can then be translated without saying that any simplification must be done care- into a graphical/structural model which reflects this DGP, fully. Indeed, the potential consequences of any resultant which we can use to make sure we are not missing variables model misspecification can be severe, and includes heavily which are key to answering a particular research question. biased parameter estimates which are almost impossible to 14,31 The process of deriving a structural model from our theory meaningfully interpret . However, there are no require- 11 33,34 has been previously discussed by Rohrer and others , ments for researchers to ‘go all the way’ with the simplifi- and we do not describe the procedure in this work, but note cation, and the proposal is flexible insofar as the degree of that the graphical framework (more about this in later sec- desired reduction can be determined by the researcher and tions) makes the process quite intuitive. their specific requirements. The advantages of reflecting the theory unambiguously We thus advocate that researchers consider the DGP up- in a structural model include reproducibility (it is clear front, before the data collection stage. Such prespecifica- what exactly is being tested) and an increase in the inter- tion in the form of a structural (or, as we will present, pretability and validity of the resulting effect sizes. Rather graphical) model represents a beneficial step in terms of than the effect sizes being arbitrary consequences of ad hoc preregistration and transparency, helps researchers distill models loosely connected to theory, they reflect specific their theories into testable models, thereby increasing the causal effects within a fully specified structural/causal validity and meaningfulness of downstream statistical in- process. Whilst the causal validity of effect sizes estimated ference and results interpretations, and provides us with an using these models still depends on whether a number of opportunity to ‘prune’ the structure to optimize for statis- strong assumptions hold (e.g., whether the hypothesized tical power during data collection. structure is correctly specified with respect to the actual, real-world structure), the transparent specification of the BACKGROUND model makes subsequent criticisms and revisions more pre- cise. The task of translating our theories may also highlight In this section, we introduce a number of relevant technical possible weaknesses in the theory, or call attention to pos- concepts for reducing our structural models. In general, we sibly insurmountable difficulties for data collection. For in- assume that the model is being specified in graphical form stance, theories which involve dynamic processes that un- as a path model, or a Structural Equation Model (SEM), fold at irregular intervals over time may require very where directed paths/arrows correspond with causal links. specific, expensive, and challenging data collection proce- As we mention above, the techniques we use are more gen- dures . Identifying the specifics of such challenges in ad- eral than the SEM framework, and come from the graphical vance could save a lot of wasted time and effort. models literature. A number of existing resources discuss Unfortunately, the task of identifying all relevant vari- the implications of changes in causal structures on statisti- ables will likely implicate a large number of secondary vari- cal estimation. For example, Matthew J. Vowels discusses ables (such as demographics and other theoretically related the problems that arise due to misspecification of causal constructs), and thus require longer questionnaires. The models, and notes the potential to focus on specific effects problems of statistical power, comprehensive scale inven- within a causal process; and Cinelli, Forney, and Pearl tories, and the need to collect a broad range of variables provide a laconic summary of how to choose control vari- and constructs relevant to our theory puts a lot of pressure ables such that the choice does not induce bias in our pa- on researchers to find a suitable ‘Goldilocks’ design, and rameter estimation. Unfortunately, these resources do not one or multiple methodological facets are likely to be com- discuss the possibility of reducing our SEMs to the most promised as a consequence. As such, after the specification simple model which can still yield unbiased estimates of of the full DGP, we should examine the resulting model to (possibly multiple) causal effects. identify possible shortcuts in the data collection process. To best communicate our approach, we begin with a brief Indeed, and as we will show, even if a variable or construct review of the relevant background. We aim to review four is relevant to a particular causal process, it may not be related concepts in particular: causal identification, con- required for the actual analysis. To know this, however, ditional independence, Markov Blankets, and projection. the variable needs to be transparently situated in a causal Briefly, identification is the goal of isolating causal from model for us to understand whether it is essential for an- non-causal statistical dependencies, and, when possible, swering a target research question, or not. facilitates the estimation of causal effects. It relies on con- Once the structure of the DGP is fully specified, and as ditional independencies, which describe how statistical de- we will describe in detail below, we are able to identify es- pendencies arise due to the underlying causal process, and sential substructures which are sufficient for testing our how conditioning on these variables enables us to isolate or intended hypotheses. The substructures, by definition, ex- disentangle different sources of dependence. Markov blan- clude certain variables. Thus, if we can identify these sub- kets show that, through the use of conditional independen- structures in advance of data collection, we may be able to cies, we can completely isolate an entire substructure in significantly reduce the number of constructs we need to a graph, thereby making it clear that not all variables are measure. Indeed, in example 2i in Figure 4 below, we show necessarily required for a particular research question. Fi- that it is possible to reduce the number of variables/con- nally, projection enables us to combine/reduce the number structs by two thirds, although this depends on how much of paths. This is particularly true in the case of mediation, of the causal process we are interested in testing. It goes Collabra: Psychology 4 Prespecification of Structure for the Optimization of Data Collection and Analysis where a mediator can be excluded entirely if the researcher is not interested in estimating the mediation per se. Interested readers are encouraged to consult useful re- sources by Hünermund and Bareinboim (2021; 17,32–34,37–40 ). In terms of notation, we use (or, e.g. etc.) to denote a random variable, and bold font (or, e.g. etc.) to denote a set of random variables. We use the symbols and to denote statistical inde- pendence and statistical dependence, respectively. For lin- ear systems, such statistical dependence may be identified using correlation, but the majority of our discussions are Figure 2. A set of demonstrative graphs. general and non-parametric. We use directed arrows to de- Note. This figure provide a number of example graphical models. Solid black lines indi- note a directional structural/causal dependence, and (or cate causal dependencies, dashed red lines indicate statistical dependence, parallel red bars indicate a ‘break’ in statistical dependence (example (e)), boldfont indicates a set ) for a single (or set of) unobserved variable(s). of variables, and the letter is reserved to denote unobserved variables. For example, in SCM terminology indi- cates that A is some general function of and . Here, pendency, we mean that the variables are correlated, or, tells us that is also a function of exogenous ran- more generally, statistically dependent, by consequence of dom process . Indeed, it is this which prevents the the causal relationships between the variables in the under- relationship between and and from being deter- lying causal process. ministic. Structural Equation Models (SEMs), on the other hand, assume that all endogenous variables are the result THE DATA GENERATING PROCESS of a linear weighted sum of others, such that . Here, the s are structural para- It is worth maintaining conceptual separation between: (1) meters (also called path coefficients or effect sizes) which the process occurring in the real world, which we consider we wish to estimate. The walrus-shaped assignment oper- to be the true Data Generating Process (DGP), (2) Our SEM, ator tells us that the left hand side is a structural out- which we generally want to sufficiently capture the process come of the right hand side; the equations are not intended in the real world, and (3) the specification of a multiple lin- to be rearranged and there is very much a directional rela- ear regression. Note that (1) and (2) do not have to match tionship involved. precisely. Indeed, when we create our SEM we expect it to As we construct system of equations representing our be a significant simplification of the real-world process, but SEM (or, indeed, our SCM) it is often convenient to repre- it needs to be somewhat consistent with the true process sent these relationships graphically/visually. For example, (and the degree to which this is achieved is one of the pri- consider the following set of (linear) structural equations: mary aims of our research). If it is not sufficiently consis- tent, we might deem it to be misspecified, and it will not yield meaningful statistical estimates. For example, if we have a strong theory that the true These can be represented simply as the mediation model DGP can be adequately represented by a fully mediated depicted in black, solid arrows in Figure 2(a). The variables process , then we would be advised to employ are generally not included unless they are statistically an SEM which is consistent with this structure. By consis- dependent. Of course, they frequently are dependent in tent we mean that the model we use facilitates the unbiased psychology, and this may be denoted using a curved, bidi- estimation of the parameters of interest, and that these es- rected edge, as between variables and in Figure 2(c), timated parameters correspond with something meaningful or by explicitly including the relationship as in Figure 2(d). in the real-world (e.g., causal effects sizes). One option we Such relationships can, of course, also be included in the have is to specify everything about our theory explicitly us- system of equations comprising the SEM. Note that, as a re- ing an SEM, and this can be done in graphical form to aid sult of the causal structures present in the DGP, there are formalisation. However, what we aim to show is that if we induced a number of statistical dependencies indicated in are primarily concerned with a subset of parameters (vis a Figure 2 by the red dashed lines. By induced statistical de- vis all path coefficients in the model), then in some cases we 1 Note that the theory we discuss is applicable to models with latent constructs (such as factor or measurement models), as well as those without (such as path and structural models), and generalises beyond linear models. The theory we discuss is part of the general Struc- tural Causal Modeling (SCM) and Directed Acyclic Graph (DAG) frameworks . Path models and SEMs both represent a subset of the family of SCM and DAG models, where the functional relationships between variables are assumed to be linear. In other words SCMs and DAGs make no assumptions about whether one variable is an arbitrarily complex function of another (strictly, there are exceptions to this, as discussed by ). 2 For the estimation task itself, we can either use the SEM estimation framework (and estimate all the included paths), or alternatively, we can derive a set of equivalent regression equations. Collabra: Psychology 5 Prespecification of Structure for the Optimization of Data Collection and Analysis can significantly reduce the complexity of our model with- the data, which is obviously entails more stringent require- out affecting the consistency of our resulting model. In the ments than does the estimation of only one of these paths. case of the full mediation, it is interesting to note, for ex- A detailed description of how to use identification is be- ample, that including a direct path in the SEM (in addition yond the scope of this paper, but we describe below how to the indirect effect) does not bias our estimates of the in- to isolate/disentangle statistical influence using the condi- direct path parameters. This is because the direct path will tional independency properties below. For now, let us con- have an estimated effect of zero if it does not exist in the sider the case where we are interested in estimating only real-world, and its inclusion does not influence the value one path coefficient / causal effect - the rules generalize of the coefficient estimated for the indirect path. This is to multiple coefficients. Consider the graphs in Figure 2(g) an example of how increasing the complexity of the SEM and (h). Graph (g) represents the canonical Randomized does not necessarily result in ‘disagreement’ or misspeci- Control Trial setup, where represents some treatment, fication with respect to the SEM and the real-world DGP. some outcome, and some set of covariates which help to In contrast, failing to include a direct path which does ex- explain the outcome . In this graph, the covariates are ist in the real-world DGP, can affect the resulting path esti- independent of treatment because of the random assign- mates. As such, in some cases assumptions which simplify ment of treatment. Such a structure means the only sta- the graph can be more ‘dangerous’ than those which in- tistical dependence that exists between the treatment and crease the complexity of the graph, and it is especially im- the outcome is a result of the treatment itself. This statisti- portant any simplification be done with care to avoid bias- cal dependence is thus equivalent to the causal dependence ing the estimates of the remaining path coefficients. we are interested in. As such, the effect can be directly es- Finally, note that the effect sizes of interest in the final timated by comparing the outcome under different treat- SEM can be estimated using multiple regression. Indeed, ments. Note that one may still wish to consider too - it the specification of an SEM using the popular lavaan R li- can be used to explain additional variance in in order to brary follows a very similar syntax to that used to esti- tighten the estimate of the treatment effect. In other words, mate each path using the lm regression library. Note that the inclusion of these variables may reduce the standard er- this may not always be possible, particularly if one needs to rors associated with a particular causal effect size estimate. estimate latent factors. However, we provide the equivalent In contrast, in observational studies patients may select regression syntax to highlight the equivalence between the their own treatment, and graph Figure 2(h) is more appro- techniques, and to show that even if a structural model is priate. For instance, if age is one of the covariates, older used to specify the DGP, it may be possible to use a straight- patients may prefer medication and have a lower chance of forward linear regression model for the actual estimation. recovery, whilst younger patients may prefer surgery and have a higher chance of recovery. Thus, if we wish to esti- IDENTIFICATION AND DISENTANGLING STATISTICAL mate the causal influence of treatment on the outcome , INFLUENCE we cannot simply compare the outcomes of the two treat- ment groups, but now also need to somehow adjust or ‘con- Identification concerns whether or not, for a given graph, trol’ for the additional statistical dependence that exists the causal effect we are interested in is actually estimable between and which results from the ‘backdoor’ non- from the observed data, even in the absence of an exper- causal path . This is non-causal because there 42,43 iment . In the case where the full graph is given and is no directed path between and via (the arrow points there are no unobserved confounders, all causal effects are from to , not the other way around). Knowing the rules technically identifiable from the data. This means that of conditional independencies described below, we will be there exists a mathematical expression which expresses the able to isolate the causal effect of interest such that the causal effect(s) of interest as a function of the observed sta- remaining statistical dependence between and corre- tistical associations. If a causal effect is identifiable, it may sponds with the causal dependence we actually wish to es- be possible to estimate it with only a fraction of all the ob- timate. served variables. Furthermore, if researchers are only inter- Note that we will use the term control variables to mean ested in estimating a single path coefficient in a structural variables which we wish to adjust for to identify causal ef- model, it may not be necessary to run the full SEM estima- fects of interest, and which would otherwise leave an open- tion process, and instead researchers can run a multiple re- ing for non-causal, statistical association. For example, the gression (possibly employing machine learning techniques) set of variables in Figure 2(h) could be considered to be a 44,45 to directly estimate the effect of interest . set of relevant control variables which enables us to get un- In the case where researchers are interested in the es- biased estimation of the effect of treatment on the out- timation of multiple paths (for example, in a mediation come . However, it is worth considering that a set of con- model), one can choose either to undertake a series of trol variables itself may comprise a complicated structure multiple regression analyses (and we provide examples of in its own right, and we consider two cases in the examples this below), or to estimate them simultaneously using the section below. SEM estimation framework. In both cases, however, all ef- fects of interests must fulfil the requirements for identifi- CONDITIONAL INDEPENDENCIES cation. In other words, the estimation multiple causal ef- fects (e.g., from treatment to mediator and from mediator The visual graphs provide us with a way to directly read to outcome) requires that all effects can be identified from off the conditional independency structure of the model. Collabra: Psychology 6 Prespecification of Structure for the Optimization of Data Collection and Analysis Conditional independencies tell us whether the inclusion of than we already knew. This renders statistically indepen- additional information changes anything about our knowl- dent of given , which can be expressed as: . edge. For instance, consider the (illustrative) fully mediated This is known as a conditional independence statement, be- model Testosterone Bone Length Height. This model cause it tells us which sets of variables are independent of tells us that, in the absence of a direct path from Testos- each other given a set of conditioning variables. It is worth terone to Height, if we already know someone’s Bone noting that when we run a regression (logistic or otherwise) Length, knowing their Testosterone in addition changes we are estimating some expected outcome conditioned on nothing about their likely height. In other words, no more some set of predictors. Running the regression to estimate of the statistical dependency between Testosterone and (i.e., the expected value of , controlling for Height is left to explain once Bone Length is known. Equiv- and ) from data generated according to a fully mediated alently, if we condition our knowledge on Bone Length, DGP will result in the same consequences as above: the fact Testosterone is rendered conditionally independent of we have included means that the importance given to Height. Indeed, if a linear regression is used to estimate will be zero (notwithstanding finite sample deviations). the effect of Testosterone on Height, but we include Bone Clearly, therefore, an understanding of the structure is ab- Length as a control variable, the coefficient on Testosterone solutely crucial for constructing the regression models . will tend towards zero. This is a useful example which high- For instance, if is a treatment variable and we do not lights the importance of a consideration for structure and recognise as a mediator, the inclusion of in the model the associated conditional independencies - if we do not al- will result in a negligible coefficient estimate for which ready know that the process is fully mediated, we might in- may well mislead us to think the treatment is ineffective. correctly arrive at the conclusion that Testosterone is unre- To generalise this result to other graph structures, it is lated to Height. worth committing some rules to memory. If a graph con- If our graph Testosterone Bone Length Height is a tains these two substructures: sufficient representation of the process in reality, and if the statistical relationships hold in the data we observe, then the graph is also said to be Markovian (i.e., the ‘Markov con- then knowing/conditioning on renders and statisti- dition’ holds). In fact a Markovian graph is simply a graph cally independent. Of course, without this conditioning, , for which its implied conditional independencies hold in , and are all statistically dependent. These two graphs the data it is being used to model. Conversely, if their exists are known, respectively, as a chain and a fork. One can start one or more unobserved variables which we have failed to to write the complete list of conditional independencies include in our model, and which influence the statistical which are implied by both of these two graphs is: dependencies in our data such that the Markov condition no longer holds, the graph is said to be semi-Markovian. If The first, , means that is not statistically indepen- we suspect a graph is semi-Markovian because of the pres- dent of (because causes ), the second means that ence of some unobserved confounder(s), we should do our is not statistically independent of (because causes best to update our graph and include this unobserved fac- through ), and so on. Importantly, both of the graphs in tor, so that the rules apply to our (now Markovian) model. Eq. 2 imply the same set of conditional independencies, and If we find this unobserved variable is necessary for identifi- therefore there is no way to tell them apart using statistical cation, but we simply cannot collect data for it (it might not dependencies alone. Alternatively, if a graph is structured be an easily measurably factor), then it may not be possible as follows: to estimate the causal effects of interest. Whether or not a causal effect of interest is identifiable is important to un- we have what is known as a collider. Unlike the examples in derstand early on, because it may determine the feasibility Eq. 2, variables and are actually already independent of the study. This is another reason why a graphical specifi- such that . A collider is also depicted in Figure 2(e), cation of a theory can be useful. and the parallel vertical red lines depict the ‘break’ in sta- We can use conditional independencies to isolate causal tistical dependence between and . Furthermore, con- from non-causal statistical dependence (the task of identi- ditioning on in this structure actually induces statistical fication described above), as well as to identify which vari- dependence between and - a phenomenon known as ables we need to include or exclude in our SEM. Starting 17,40 explaining away . A corresponding list of conditional with the example in the full mediation model of Figure 2(b), independency statements for this collider is therefore: we see that variable cannot contain information about which does not already ‘pass’ through . Therefore, if we already know , knowing tells us nothing more about 3 One might consider sensitivity analysis as a means to quantify the extent to which a causal effect can be explained by unobserved third variables . 4 Given that the chain and the fork are yield statistically equivalent data, it is worth considering the implications for testing for mediation structures. Collabra: Psychology 7 Prespecification of Structure for the Optimization of Data Collection and Analysis Variables are known as ancestors of downstream descen- dants if there exists a directed path between the variables. A direct descendent is also called a child, and the direct as- cendant is called a parent. Note that conditioning on de- scendants of the variable in the two graphs depicted in Eq. 2 can partially render and independent (because it essentially contains critical information from via ). Similarly, conditioning on a descendent of the collider vari- able in Eq. 4 can also render variables and partially dependent. Of course, two variables are either dependent Figure 3. An illustration of ‘infinite mediation’. or not, and the partial terminology is used here to com- Note. This figure illustrates that between any two cause-effect pairs, there exists an al- municate that the effect of conditioning is not as strong as most infinitely decomposable chain of intermediate mediators. would be the case using itself, as opposed to one of its descendants. We can actually test for these conditional in- An SEM model can be reduced in size to comprise only dependencies using conditional independence tests (which, the variables and paths necessary to estimate set of paths in the linear Gaussian setting are essentially partial corre- of interest. Considering, again, Figure 2(f), if we are only lations). These tests can then be used to discover the under- interested in the path coefficients proximal to the variables lying structure in the data - a task known as causal discov- and , we do not need variables or , thus reducing ery, for which many methods exist . the number of estimated paths from ten (if we include the Finally, returning to Figure 2(h), which was discussed paths from unobserved ) to five. We discuss more oppor- above in relation to estimating the effect of treatment on tunities below. outcome given some confounders , we know that for the substructure , we can achieve in PROJECTION order to essentially simulate the structure of the graph for the RCT in Figure 2(g). In other words, by conditioning on A cause-effect relationship can often be broken down into we ‘block the backdoor’ path of confounding statistical smaller and smaller subdivisions, until one starts talking dependence which ‘flows’ from treatment to outcome by about the effect of one molecule on the next to explain a conditioning on . This leaves only the one statistical path, simple game of billiards. As per Figure 3, each subdivision which is also the causal path we care about. In this case, the of the cause-effect relationships between and could be statistical dependence is equivalent to the causal depen- represented as a mediating path with an infinite number dence we wish to estimate. Thus, we have used conditional of intermediate mediating paths. By consequence of the independency rules to isolate the causal statistical depen- Markov assumption (described above) it is thankfully not dencies, and disentangle them from the non-causal statis- necessary to model all these intermediate mediators, and it tical dependencies. suffices to abstract to the key ‘beginning and end points’. For instance, it is not necessary to know the intermediate MARKOV BLANKET position and velocity of a billiard ball (assuming these are well known), but it may be important to know when/if it The conditional independency rules introduced above can changes course following a collision. One can, for example, be used to define a Markov Blanket. Essentially, the blanket 47(p40) reduce simply to . Of course, if constitutes a set of variables which yield conditional inde- one is specifically interested in a mediating variable then pendence between variables ‘within’ the blanket, and those one can collect the relevant data and explore the process outside it. The notion of a Markov Blanket confirms the (such examples are provided below). Some reductions may idea that not all variables are necessarily needed to esti- yield an intractably blunt abstraction, or, in the extreme, mate or identify a particular causal effect. The implication a form of infinite causal regress (e.g. regressing all first of this is that if we have knowledge of a set of conditioning causes to our birth or the beginning of time), and one might variables, other variables which are causally ‘downstream’ instead consider more modest examples, such as whether of these conditioning variables become effectively ‘discon- 5 a treatment is mediated by some psychological mecha- nected’ from those which are upstream. nism(s). In this case, one can nonetheless reduce the prob- Consider Figure 2(f) which depicts a Markov blanket lem (via projection) to an estimation of the total effect of around variables and . The underlined variables , , treatment on the outcome, thus aggregating the interme- and constitute the Markov blanket - knowing or condi- diate direct and indirect effects and thereby reducing the tioning on these variables renders and independent of complexity of the graphical representation. variables and , which are outside of the blanket. 5 It is possible to have variables which fall into the set of defining Markov blanket variables but which do not need to be explicitly condi- tioned on. This can occur, for example, in the presence of a collider structure which may already render upstream variables (which are outside of the blanket) as statistically independent of those within the blanket, without conditioning (recall that conditioning on a col- lider can open up an otherwise ‘closed’ path. Collabra: Psychology 8 Prespecification of Structure for the Optimization of Data Collection and Analysis Figure 4. Finding the reduced model. Note. This figure presents a number of examples for taking the full ‘true’ Data Generating Process (DGP) and finding the reduced graph and minimal linear/logistic regression re- quired to answer a given research question. the graphical representation of the theory is what enables REDUCING SEMS - WORKED EXAMPLES us to reduce the model in a way which does not invalidate the subsequent analysis (as well as increasing transparency, In the previous section we reviewed four concepts which we helping us to think more deeply and concretely about the will use for simplifying our SEMs without introducing bias causal process, etc.). into our effect estimates: (1) causal identification, (2) con- In practice, the graphical representation of our DGP will ditional independencies, (3) Markov Blankets, and (4) pro- be developed using domain knowledge and/or causal dis- jection. In order to demonstrate these various techniques, 14,37,48 covery techniques . For now, we provide general ex- we will walk through a number of examples which are pre- amples with a view to demonstrating the ways in which the sented in Figure 4. For each example, we specify (a) a full concepts reviewed above can be used to reduce our SEM. DGP as our starting point which we assume to be true and Similarly, in practice the set of paths of interest will be complete (‘Full DGP’ in Figure 4), (b) a set of causal effects determined by our research questions and our hypotheses. of interest, that must be identifiable for subsequent estima- Note that it may be possible to simplify SEMs bearing in tion (‘Research Question’ in Figure 4), (c) a minimal SEM mind other techniques which are applicable to linear mod- (denoted Reduced in Figure 4), and (d) syntax for the R lm() els (such as instrumental variables) , but we focus on function for a multiple regression. Five example DGPs are those techniques reviewed above because they are generally shown in Figure 4. Again, whilst we are not concerned with applicable to a much broader family of problems. Finally, it the estimation itself, note that one can choose to either use is worth remembering that if a set of variables and paths are the SEM framework to estimate all the path coefficients in not needed for the SEM, then we also do not need to collect the resulting model, or one can undertake (possibly multi- these variables to begin with, thus saving additional time ple) regressions to arrive at the same goal. In both cases, Collabra: Psychology 9 Prespecification of Structure for the Optimization of Data Collection and Analysis and expense which could be used to, for example, collect backdoor path , we do not need to estimate the more samples of the variables that really matter. Note that actual path so long as we include the path . some variables may not strictly be necessary for the esti- The inclusion of facilitates identification of the principal mation of the effect but may nonetheless be worthwhile in- effect of interest . Note that in this case we do not cluding. For example, proximal causes of an outcome which have to use SEM for the estimation procedure. Indeed, in do not interfere with our estimation of other desired causes this example we are not interested in the path coefficients can be used to increase the precision/tightness of our esti- linking to either, even though these paths must be in- mates . cluded to acknowledge the dependence that has on and Unobserved variables and/or latent constructs may also to block the backdoor path. Given we are only interested in be integrated into the specification of the graph. In terms of the path from to , we can simply run a multiple regres- the planning, these objects can be considered in the same sion, using as control variables and restricting interpre- way as other observed variables, at least insofar as they re- tation to the coefficient on . Note that the resulting lm() late to the estimation of the causal dependence we are in- syntax contains only the two necessary components as pre- terested in. One may find, for example, that the existence of dictors - and the set of control variables . certain unobserved variables fundamentally preclude iden- Finally, we do not need to include in the model (nei- tification (i.e., the estimation of the target effect), perhaps ther do we need to collect data for ) because it is not nec- because they induce a backdoor/confounding path between essary for the causal identification of the target causal ef- the ‘treatment’ and the outcome. Conversely, one may find fect of interest. Adding the path into the model is that either certain unobserved variables, or particular la- superfluous to the effect we are interested in. tent constructs are not necessary for the identification of the target effect. We later consider a number of worked ex- EXAMPLE 2: STRUCTURED CONTROLS amples involving unobserved variables (Examples 3 and 4). The first graph with structured controls is given as example To motivate the examples, we will attempt to describe 2 in Figure 4. We can consider the meaning of variables semi-plausible DGPs for psychological processes, but note , and to be the same as in Example 1, that is at- that these examples are likely to be overly simplistic, and titude, treatment, treatment-outcome mediator, and out- are only intended to illustrate the process. We will discuss come, respectively. The difference now is that we also have each of the examples in Figure 4 in turn. Finally, in the sup- a mediation child , an outcome child , and a structured plementary material we also provide simulation results for set of control variables , and . If, as indicated in ex- DGPs 2-5 in Figures A1-A3. ample 2i, we are only interested in estimating the effect of on then, as in the first example, we can ignore and EXAMPLE 1: MEDIATED TREATMENT . Similarly, we can also exclude and for our reduced Starting with the first example depicted in Figure 4, let us model, as their existence in the DGP does not change the begin by considering what this graph could possibly repre- principal relationship we are interested in. sent. Variable could be an outcome (e.g. depressive symp- There still exists a backdoor path through the control toms) for a therapy , the effect of which is mediated by variables , and , and so we need to understand therapeutic alliance . The set represents covariates which of the associated variables and paths to include in that influence the choice of therapy modality as well as our reduced model to adjust for this spurious path. There the likelihood of recovery, and includes factors such as age, exist the following options which block this path: , gender, history of mental health problems, and so on. Fi- , and . Note that is not an nally, variable could represent a personal attitude which option by itself because this would leave the path through influences the choice of treatment but which does not in- open. Note also that we do not need to estimate fluence whether the person recovers. the path because we are not interested in this ef- For this example, let us assume that our research ques- fect. Thus, overall, our initial/complete model reduces to tion concerns estimation of the efficacy of treatment on the the estimation of only two paths (reduced from ten), as in outcome, i.e., . The reduced model (denoted in Fig- the previous example. The linear regression also remains ure 4 as Reduced) requires three fewer paths to estimate equivalent. this effect. Firstly, if we are not interested in the particulars If our research question involved the estimation of the of the mediated path then we do not need to mediation, as in example 2ii in Figure 4, then the only include , or to therefore collected data for change to the model needs to be the inclusion of the me- (afforded by the projection concept reviewed above). Sec- diation . The linear regression now involves ondly, even though there exists a spurious/confounding/ two stages to decompose the problem into two sets of paths 6 We omit simulations for DGP 1 because it represents a reduction of the other examples, and so including it is somewhat redundant. 7 Indeed, its inclusion can even increase the standard errors on the effect of because it makes it ‘harder’ to disentangle the vari- ance in that stems from and the residual variation of which is also contained in . Collabra: Psychology 10 Prespecification of Structure for the Optimization of Data Collection and Analysis (one from , and the other comprising the paths If we are interested in the partial mediation of class size, and ). homework, and math exam score, then we can simply aug- ment the reduced model from example 3i to include this ad- EXAMPLE 3: COLLIDING CONTROLS ditional structure. The linear regression also changes to ac- commodate the estimation of the additional paths, as with One might be forgiven for thinking that the safest thing to example 2ii. do with a set of control variables is to always include them in the model to make sure we are blocking the backdoor EXAMPLE 4: SIMPLE UNOBSERVED CONFOUNDING paths. In the previous example, for instance, we could just play it safe by including . However, example 3 in The fourth example is relatively straightforward. Here, Figure 4 shows that some putative control variables may in- , and could represent relationship satisfaction, part- clude collider structures. Let us consider that variables , ner support, and communication style, respectively, where , and are class-size, math exam score, and language the unobserved confounder between support and com- exam score, respectively. represents a mediator such as munication. The unobserved confounder induces a non- whether a student does their homework, represents So- causal statistical dependence between and through , cial Economic Status (SES) - perhaps children with higher and the reduced model therefore needs to include the path SES attend schools with smaller class sizes and have bet- . The linear regression, similarly, needs only and ter grades overall - represents an unobserved attribute of as predictors. intelligence a measured attribute of intelligence, and musical ability. EXAMPLE 5: LONGITUDINAL DYADIC EFFECTS Based on example 3i we are interested in the effect of The final example concerns a longitudinal dyadic process, class size on math exam score. It might be tempting to in- whereby variables for relationship satisfaction for two clude the paths concerning the other related scores (such as individuals and are collected at three timepoints, but language score, or musical ability). In the case of musical there exist intermediate opportunities where confounding ability, we could include the paths with- could occur. This confounding could represent, for exam- out causing any problems, but it doesn’t actually help us ple, shared stressful events. The target causal effects in- estimate the effect we are interested in. Indeed, the collider volve all of the ‘actor effects’ (that is, autocorrelation in structure prevents any backdoor information each individual’s variables which results in similar values affecting our estimation of , so we do not need across consecutive timepoints), as well as two partner ef- these paths for causal identification. Another collider exists fects from and a ‘concurrent’ effect . between , and even though the structure This example demonstrates when the use of SEM may be is the same, the fact that is unobserved means we cannot less complicated than undertaking a series of multiple re- and should not include in the model. Indeed, if were gression tasks; our research question concerns the estima- to be included (without as is unobserved) we would tion of six separate causal effects, all of which have to be induce a spurious path linking to through and . identified. Thus, whilst these might appear to be tempting control We do not need to estimate the paths , so long as variables which we might think would, at best increase pre- we include the path , which enables us to block the cision and at worst do nothing, in fact they should not be backdoor path from to via and thereby identify the included owing to the collider structure with an unobserved effect . For the same reasons, we do not need to variable. estimate the path . In this example, we are not able We have no need to include paths relevant to or to make any data collection savings (i.e., we need to collect in our model. Including the path may improve the all variables), even though some of the path coefficients are precision of our desired estimate, but it is not necessary. not needed for estimation of the principal causal effects of The partial mediation through , if not part of our research interest. question, does also not need to be included. The only path we have to be concerned about is , and we can REAL WORLD EXAMPLE deal with the induced statistical path by simply including the path . In this case, the the reduced model con- To motivate the application of the techniques to non-syn- tains two paths, whilst the full model (including the unob- thetic examples, we have chosen a graph adapted from a served paths) involves thirteen. The corresponding linear paper published in the domain of business psychology and regression is equally simple, and only includes and as organizational behaviour. The graph is shown in Figure 5, predictors. 8 Even though the causal framework does not strictly admit simultaneity (there must be some time delay between the case and the effect), we assume that this concurrence is permitted according to the data collection procedure (i.e., within wave three, partner can influence partner with some arbitrary time delay which is not distorted by the otherwise cross-sectional nature of the data collection methodol- ogy). Collabra: Psychology 11 Prespecification of Structure for the Optimization of Data Collection and Analysis Figure 5. Real-world example graph. Note. Real-world example graph adapted from Spurk and Abele . Figure 6. Reduced real-world example graph. Note. Reduced real-world example graphs for the real-world DGP assumed in Figure 5. Bold black lines are those key to a multiple-mediation research question, whereas red dashed lines are those that may be excluded from a graphically specified SEM without affecting the estimation of the target paths. and was presented to test the relationship between person- the number of paths to be estimated. The reduced graph is ality (‘P’ in the graph), and salary (‘S’). First, let us con- shown in Figure 6ii. Identifying this reduced solution by eye sider the model required in the case where our research is already becoming challenging, and automated tools (such question solely concerns . The only non-causal path as the one provided in supplementary material) are help- from personality to salary, assuming the graph shown in ful in ensuring the reduction is correct. In addition, iden- Figure 5, is via gender: . The reduced graph tifying the set of multiple regressions which can yield un- is shown in Figure 6i. In this case, the simple regression biased estimates of each of the target paths is also quite would suffice, and the graphical representation involved, and this example demonstrates how the SEM esti- of the SEM would be . Once again, it is only mation framework might provide a more convenient alter- possible to confirm this if we already have a representation native. In any case, it can be seen that six out of a total of of our model which enables us to identify the required con- 24 paths were not required. trol variables. In the original work , the researchers were specifically DISCUSSION interested in a double-mediation by occupational self-effi- cacy (‘OSE’) and career advancement goals (‘CAG’), which We have provided a number of didactic examples showing represent the first set of mediating variables, and working that if we are presented with a specific question regarding a hours (‘WH’) which represents a second mediation of the relatively complex process, we can simplify our SEMs con- effect of personality on salary. In this case, all variables are siderably. The simplification process takes advantage of a required for the analysis, and no savings can be made at number of graphical rules, and does not introduce any ad- the data collection stage, but we can nonetheless reduce Collabra: Psychology 12 Prespecification of Structure for the Optimization of Data Collection and Analysis ditional assumptions to those which already apply to the Acyclic Graphs (DAGs). DAGs do not make assumptions full model. Furthermore, researchers are also free to choose about the parametric (e.g., normally distributed vs. non- whether they actually wish to estimate all the path coeffi- parametric) form of the variables, nor about the functional cients using SEM framework itself, or whether a multiple (linear vs. non-linear) form relating variables. This means regression would be more straightforward. Indeed, in cases that when one uses our proposed method to construct and where only a single causal effect needs to be estimated, one subsequently simplify a graphical structure, they can also might consider using the graphical representation first, and consider themselves to be working directly with a DAG. If then estimating it using a multiple regression instead. In the researcher then wishes to avoid making assumptions this work we provided both the graphical representation of about the functions and distributions, they do not have to the SEM that one needs to estimate in order to answer a re- use the SEM framework to do the estimation, but can in- search question relating to one or more causal effects, as stead use non-parametric regression or machine learning well as the equivalent multiple regression equation(s). techniques (a discussion about which is beyond the scope In one of the demonstrative examples, an SEM with up- of this paper). Indeed, another reason that we provide the wards of thirteen paths was reduced to only two. The simu- multiple regression syntax is because its specification can lation results provided and discussed in the supplementary be generalized relatively straightforwardly to non-paramet- material highlight unsurprising improvements in adjusted ric settings. For example, the specification of the regression model fit metrics (unsurprising because simpler models are relates to the estimation of , which is penalised less than complex models according to such met- the conditional expectation of given and . The con- rics). Importantly, note that the simplification process does ditioning set given on the right hand side of the tilde in the not bias the effect size estimates. regression syntax, or the right hand side of the conditional Even without the simplification process, translating a expectation, are the variables/predictors in the regression psychological theory into a graph is a worthy exercise, par- which are being used to identify the causal effect(s) of in- ticularly when undertaken before the data collection stage. terest, and this can be done in both linear parametric as It helps us be transparent and unambiguous about our well as non-linear, non-parametric settings. model and assumptions, increases specificity for preregis- The reduction which is achievable depends on the re- tration, and can highlight potential methodological chal- search questions being asked, as well as the requirements lenges and difficulties before any resources have been ex- of the researcher. We foresee that some researchers may pended. It may even highlight cases where estimation is not wish to collect more variables than are strictly required for possible, and this relates to the problem of causal identi- identification to future-proof their datasets, thereby facili- fication. For example, if there exists an unobserved con- tating the testing of currently unspecified hypotheses. The founder between and in the graph , i.e. collection of extra variables can not only provide the oppor- , the causal effect cannot be estimated because tunity for researchers to answer potentially unforeseen re- the non-causal statistical association induced by the con- search questions, but it also enables researchers to include founder cannot be adjusted for without access to . These ‘hedge’ variables, in cases where the theory specification problems can, again, be seen by an inspection of the graph, is uncertain and researchers do not want to risk variable and it is worthwhile identifying these problems sooner omission. Indeed, if the researcher is contending with mul- rather than later. In practice, such problems may be com- tiple hypothesized graph structures, they may wish to avoid mon, and either a researcher must do all they can to ac- putting all their eggs in one basket by collecting only the count for the possible unobserved confounders, or they smallest set of variables relevant for one particular graph must assume that a sufficient number have been already and one particular research question. By ‘over-collecting’ collected to assume that the problem is ‘ignorable’ . In variables, they may also open up opportunities to under- general, it is important to remember that the goal of esti- take causal discovery - a data driven approach to the vali- mating causal effects rests on a number of strong (and of- dation of putative causal structures. Without the extra vari- ten untestable) assumptions. However, it is only by taking ables, researchers would be somewhat stuck with what they causality seriously that we can understand what these as- have. sumptions are and whether they are reasonable. Finally, researchers should be mindful that the success of the approach rests on the degree of correct specification LIMITATIONS achieved when the DGP model is constructed. However, this limitation applies to all statistical approaches which con- We have used SEM throughout the text because researchers cern the estimation of interpretable / causal effects, and in psychology may be familiar with this framework . Fur- this approach does not alleviate the consequences of model thermore, if they wish/need to estimate latent variables, misspecification. Furthermore, reducing model complexity the SEM framework readily facilitates this. Note, however, may reduce the precision of the estimation because less that SEM is generally considered to be an estimation frame- explanatory power may be available to estimate an effect. work, rather than a means to graphically represent one’s This is evidenced by a review of the simulation results for causal theory. Furthermore, SEM usually assumes linear the -values in the supplementary material. This downside (or at least pre-specified) functional relationships between is somewhat offset by the possibility that, with a simpler variables. Fortunately, and as we briefly discussed earlier, model, a larger sample size may be acquired for equivalent all the rules and techniques discussed in this work belong cost. For example, if the simplification process indicates to a broader class of graphical model known as Directed Collabra: Psychology 13 Prespecification of Structure for the Optimization of Data Collection and Analysis that a number of constructs with large inventories are no about and formal specification of the causal structure of longer required, we may gain back significant data collec- data generating process itself, and does not concern redun- tion time which can be put towards the recruitment of more dancies in the scales used to measure the constructs/vari- participants. Such possibilities therefore enable us to in- ables within this structure. The data generating process can crease statistical power for estimating the effects we really therefore be considered independently of scale-item redun- care about. Furthermore, the specification of larger mod- dancy. Similarly, planned missingness techniques include els increases the chances of misspecification (simply put, in split form designs which split large questionnaires into the specification of larger graphical models, there is more multiple smaller blocks, each of which is completed by par- opportunity for error). Reducing the model and being spe- ticipants at different stages of a longitudinal design. Alter- cific and less ambitious about the number of primary effect natively, multiple imputation provides researchers with a sizes of interest (as opposed to wishing to estimate as many way to leverage statistical associations to compensate for effects as possible) increases the likelihood that, at the end instances of missing data. Again, in contrast with our pro- of the project, we have estimated something meaningful. posal, this approach does not consider the opportunities al- ready implicit in the specification of our theory. RELATED OPTIONS CONCLUSION It is worth noting that other approaches for streamlining data collection and reducing study cost, such as the tools In summary, graphical representations of our theories pro- 50,51 for the development of short-form scale design and vide us with an opportunity to encode our domain knowl- planned missingness design . In the case of the former, edge about a particular phenomenon of interest. In this researchers can use statistical techniques to identify re- paper we showed that, by using graphical modeling rules duced scale designs which provide similar performance in (in particular, the concept of conditional independencies), terms of certain scale quality measures, such as validity. In we can significantly shrink the required causal structural the case of the latter, there are a number of planned miss- model without affecting the validity of the associated esti- ingness techniques which enable researchers to amortize mates, thereby reducing the required sample size and en- data collection cost over the course of a longitudinal de- abling us to redirect resources and funds towards the col- sign, or to leverage statistical associations to compensate lection of variables which are critical to answering the for foreseen missing data. These methods differ signifi- questions we care about. cantly from our proposal, and can even be used in combi- nation with ours. Specifically, the short-form scale design Submitted: September 23, 2022 PST, Accepted: January 20, approaches are motivated by the fact that there may exist 2023 PST redundant information in a scale which is already repre- sented by other items (or combinations, thereof). In con- trast, our approach is concerned with the assumptions This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CCBY-4.0). View this license’s legal deed at http://creativecommons.org/licenses/by/4.0 and legal code at http://creativecom- mons.org/licenses/by/4.0/legalcode for more information. Collabra: Psychology 14 Prespecification of Structure for the Optimization of Data Collection and Analysis REFERENCES 1. Lavrakas P. Encyclopedia of survey research 12. Spurk D, Abele AE. Who earns more and why? A methods. 2008;1. doi:10.4135/9781412963947 multiple mediation model from personality to salary. J Bus Psychol. 2010;26(1):87-103. doi:10.1007/s1086 9-010-9184-3 2. Sassenberg K, Ditrich L. Research in social psychology changed between 2011 and 2016: Larger sample sizes, more self-report measures, and more 13. Blanca MJ, Alarcón R, Bono R. Current practices online studies. Advances in Methods and Practices in in data analysis procedures in psychology: What has Psychological Science. 2019;2(2):107-114. doi:10.1177/ changed? Front Psychol. 2018;9. doi:10.3389/fpsyg.20 2515245919838781 18.02558 3. Baker DH, Vilidaite G, Lygo FA, et al. Power 14. Vowels MJ. Misspecification and unreliable contours: Optimising sample size and precision in interpretations in psychology and social science. experimental psychology and human neuroscience. Psychological Methods. Published online October 14, Psychological Methods. Published online 2020. 2021. doi:10.1037/met0000429 4. Correll J, Mellinger C, McClelland GH, Judd CM. 15. Wright S. Correlation and causation. Journal of Avoid Cohen’s ‘Small’, ‘Medium’, and ‘Large’ for Agriculture Research. 1921;20:557-585. Power Analysis. Trends in Cognitive Sciences. 2020;24(3):200-207. doi:10.1016/j.tics.2019.12.009 16. Wright S. The theory of path coefficients: A reply to Niles’ criticism. Genetics. 1923;8(3):239-255. doi:1 5. Aarts AA et al. Estimating the reproducibility of 0.1093/genetics/8.3.239 psychological science. Science. 2015;349(6251):943-950. doi:10.1126/science.aac471 17. Pearl J. Causality. Cambridge University Press; 2009. doi:10.1017/cbo9780511803161 6. Scheel AM. Why most psychological research 18. Wagenmakers EJ, Wetzels R, Borsboom D, van der findings are not even wrong. Infant and Child Maas HLJ, Kievit RA. An agenda for purely Development. 2022;31(1). doi:10.1002/icd.2295 confirmatory research. Perspect Psychol Sci. 2012;7(6):632-638. doi:10.1177/1745691612463078 7. Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution. Proc Natl Acad Sci 19. Flake JK, Fried EI. Measurement USA. 2018;115(11):2600-2606. doi:10.1073/pnas.1708 schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science. Published 8. Navarro DJ. If mathematical psychology did not online January 17, 2019. doi:10.31234/osf.io/hs7wm exist we might need to invent it: A comment on theory building in psychology. Perspect Psychol Sci. 20. Scheel AM, Tiokhin L, Isager PM, Lakens D. Why 2021;16(4):707-716. doi:10.1177/1745691620974769 hypothesis testers should spend less time testing hypotheses. Perspectives on Psychological Science. 9. Haslbeck JMB, Ryan O, Robinaugh DJ, Waldorp LJ, Borsboom D. Modeling psychopathology: From data 21. Gigerenzer G. Statistical rituals: The replication models to formal theories. Psychological Methods. delusion and how we got there. Advances in Methods Published online November 4, 2021. doi:10.1037/met and Practices in Psychological Science. 2018;1(2):198-218. doi:10.1177/2515245918771329 10. Grosz MP, Rohrer JM, Thoemmes F. The taboo 22. McShane BB, Gal D, Gelman A, Robert C, Tackett against explicit causal inference in nonexperimental JL. Abandon statistical significance. The American psychology. Perspect Psychol Sci. Statistician. 2019;73(sup1):235-245. doi:10.1080/0003 2020;15(5):1243-1255. doi:10.1177/174569162092152 1305.2018.1527253 23. Marsman M, Schönbrodt FD, Morey RD, Yao Y, 11. Rohrer JM. Thinking clearly about correlations Gelman A, Wagenmakers EJ. A Bayesian bird’s eye and causation: Graphical causal models for view of ‘Replications of important results in social observational data. Advances in Methods and Practices psychology.’ R Soc open sci. 2017;4(1):160426. doi:1 in Psychological Science. 2018;1(1):27-42. doi:10.1177/ 0.1098/rsos.160426 Collabra: Psychology 15 Prespecification of Structure for the Optimization of Data Collection and Analysis 24. Vankov I, Bowers J, Munafò MR. Article 37. Vowels MJ, Camgoz NC, Bowden R. D’ya like Commentary: On the Persistence of Low Power in DAGs? A survey on structure learning and causal Psychological Science. Quarterly Journal of discovery. ACM Comput Surv. 2022;55(4):1-36. doi:1 Experimental Psychology. 2014;67(5):1037-1040. doi:1 0.1145/3527154 0.1080/17470218.2014.885986 38. Peters J, Janzing D, Scholkopf B. Elements of 25. Maxwell SE. The Persistence of Underpowered Causal Inference. MIT Press; 2017. Studies in Psychological Research: Causes, Consequences, and Remedies. Psychological Methods. 39. Koller D, Friedman N. Probabilistic Graphical 2004;9(2):147-163. doi:10.1037/1082-989x.9.2.147 Models: Principles and Techniques. MIT Press; 2009. 26. Crutzen R, Peters GJY. Targeting next generations 40. Pearl J, Glymour M, Jewell NP. Causal Inference in to change the common practice of underpowered Statistics: A Primer. Wiley; 2016. research. Front Psychol. 2017;8. doi:10.3389/fpsyg.201 7.01184 41. Rosseel Y. An R Package for Structural Equation Modeling. J Stat Soft. 2012;48(2):1-36. doi:10.18637/js 27. Gelman A, Hill J, Vehtari A. Regression and Other s.v048.i02 Stories. Cambridge University Press; 2021. 42. Huang Y, Valtorta M. Pearl’s calculus of 28. Sedlmeier P, Gigerenzer G. Do studies of intervention is complete. Proceedings of the Twenty- statistical power have an effect on the power of Second Conference on Uncertainty in Artificial studies? Psychological Bulletin. 1989;105(2):309-316. Intelligence. 2006;arXiv:1206.6831:217-224. doi:10.55 doi:10.1037/0033-2909.105.2.309 55/3020419.3020446 29. Goldberg LR. Personality psychology in europe. 43. Shpitser I, Pearl J. Complete identification In: Mervielde I, Deary I, De Fruyt F, Ostendorf F, eds. methods for the causal hierarchy. Journal of Machine Tilburg University Press; 1999:7-28. Learning Research. 2008;9:1941-1979. doi:10.5555/13 90681.1442797 30. Goldberg LR, Johnson JA, Eber HW, et al. The international personality item pool and the future of 44. Vowels MJ, Camgoz NC, Bowden R. Targeted VAE: public-domain personality measures. Journal of Variational and Targeted Learning for Causal Research in Personality. 2006;40(1):84-96. doi:10.101 Inference. 2021 IEEE International Conference on 6/j.jrp.2005.08.007 Smart Data Services (SMDS). Published online September 2021. doi:10.1109/smds53860.2021.00027 31. Hullman J, Kapoor S, Nanayakkara P, Gelman A, Narayanan A. The worst of both worlds: A 45. van der Laan MJ, Rose S. Targeted Learning - comparative analysis of errors in learning from data Causal Inference for Observational and Experimental in psychology and machine learning. arXiv preprint. Data. Springer International; 2011. 2022;arXiv:2203.06498. 46. Vowels MJ. Trying to outrun causality with 32. Cinelli C, Forney A, Pearl J. A crash course in machine learning: Limitations of model good and bad controls. SSRN Journal. Published explainability techniques for identifying predictive online 2020. doi:10.2139/ssrn.3689437 variables. arXiv preprint. 2022;arXiv:2202.09875. 33. Kline RB. Principles and Practice of Structural 47. Glymour C. The Mind’s Arrows. The MIT Press; Equation Modeling. Guilford Press; 2005. 2001. doi:10.7551/mitpress/4638.001.0001 34. Loehlin JC, Beaujean AA. Latent Variable Models: 48. Glymour C, Zhang K, Spirtes P. Review of causal An Introduction to Factor, Path, and Structural discovery methods based on graphical models. Front Equation Analysis. Routledge Taylor and Francis; Genet. 2019;10. doi:10.3389/fgene.2019.00524 49. Bollen KA. Model implied instrumental variables 35. Hilpert P, Brick TR, Flueckiger C, et al. What can (MIIVs): An alternative orientation to structural be learned from couple research: Examining equation modeling. Multivariate Behavioral Research. emotional co-regulation processes in face-to-face 2018;54(1):31-46. doi:10.1080/00273171.2018.148322 interactions. Journal of Counseling Psychology. Published online 2019. 36. Hünermund P, Bareinboim E. Causal inference and data fusion in econometrics. arXiv preprint. 2021;arXiv:1912.09104v3. Collabra: Psychology 16 Prespecification of Structure for the Optimization of Data Collection and Analysis 50. Greer F, Liu J. Pinciples and methods of test 54. Maruyama G. Basics of Structural Equation construction: Standards and recent advances. In: Modeling. SAGE Publications, Inc.; 1998. doi:10.4135/ Schweizer K, DiStefano C, eds. Hogrefe Publishing; 9781483345109 2016:272-287. 55. Hoyle RH, Panter AT. Structural equation 51. Smith GT, Combs JL, Pearson CM. Brief modelling: COncepts, issues, and applications. In: instruments and short forms. APA handbook of Hoyle RH, ed. SAGE Publications; 1995:158-176. research methods in psychology, Vol 1: Foundations, planning, measures, and psychometrics. Published 56. Maclaren OJ, Nicholson R. What can be online 2012:395-409. doi:10.1037/13619-021 estimated? Identifiability, estimability, causal inference and ill-posed inverse problems. arXiv 52. Wood J, Matthews GJ, Pellowski J, Harel O. preprint. 2020;arXiv:1904.02826v4. Comparing different planned missingness designs in longitudinal studies. Sankhya B. 2018;81(2):226-250. 57. Díaz I, van der Laan MJ. Sensitivity analysis for doi:10.1007/s13571-018-0170-5 causal inference under unmeasured confounding and measurement error problems. The International 53. Raghunathan TE, Grizzle JE. A split questionnaire Journal of Biostatistics. 2013;9(2):149-160. doi:10.151 survey design. Journal of the American Statistical 5/ijb-2013-0004 Association. 1995;90(429):54-63. doi:10.1080/0162145 9.1995.10476488 Collabra: Psychology 17 Prespecification of Structure for the Optimization of Data Collection and Analysis no surprise because here the complexity of the model im- pacts our ability to reduce error for the path coefficients we APPENDIX: SIMULATION RESULTS are estimating (reducing the degrees of freedom). For simi- lar reasons, it is also not surprising that the differences for The purpose of the simulation is to illustrate the differ- the full and reduced models for DGP 5 were not different ences in , Root Mean Squared Error of Approximation - the reduced model did not differ greatly in its reduction (RMSEA), Comparative Fit Index (CFI), Mean Absolute Error of complexity. In this sense, reducing the complexity of the (MAE) and p-values, between two models which differ in model can have an effect on the resulting , in such a way complexity but which are otherwise correctly specified that yields a value which is considered desirable (of course, (with respect to the true, underlying DGP. It is worth noting in practice we should specify theories based on more than that is known as an ‘absolute’ fit index, and is not ad- just the resulting fit-statistics). justed for model complexity. A lower value indicates bet- In Figure A1 we provide estimates for the target effect ter fit and provides a measure of how much our sample co- size ‘Coefs’, on top of the true effect size ‘True Coef’. Im- variance matrix differs from our fitted covariance matrix. In portantly, the results confirm that the simplification contrast, RMSEA adjusts for the model complexity (favour- process does not bias the estimates - all model variants cor- ing model parsimony), and here a lower value is preferred. rect estimate the effect size. Finally, CFI is not adjusted for model complexity, and Results for CFI (higher is better) and RMSEA (lower is higher values are preferred. For more information on these better) are shown in Figure A2. Once again, the smaller metrics, readers are pointed towards works by Maruyama models are preferred and yield higher CFI values. This again (1998; ). comes as a consequence of the complexity of the larger It is important to note that under these conditions (and models and the concomitant impact on estimation. This when researchers use the process/tools presented in this notwithstanding, as the sample size increases, the results work), the causal effect size estimates are unbiased regard- converge fairly quickly. The RMSEA results indicate a great less of whether the full model or the reduced models are improvement with the use of the reduced models, particu- used. As such, even though the use of these tools can have larly for smaller sample sizes. This is not surprising beacuse an effect on the standard errors (and therefore also the RMSEA is an adjusted metric, and so the results are consis- -values and null-hypothesis significance testing), it does tent with the expectation that lower RMSEA values are as- not affect the large-sample performance of the model. In- sociated with smaller models. deed, this is evidence in the lower four plots of Figure A1, Finally, the p-values and MAEs for the target effect size which confirm that the choice of model does not affect the estimates are shown in Figure A3. For DGP 2 (top left plot), effect size estimates (all are unbiased). Nonetheless, it is the -values are higher for the reduced model than the important to understand the possible impact on the var- complete model. This is consistent with the expectation ious model metrics to understand that two different cor- that the inclusion of more variables can help increase the rectly specified models can yield different finite-sample be- precision of our estimates. Indeed, in general we expect haviours. These differences are discussed in more detail in that the inclusion of variables into a structural equation this section. model will reduce the standard error and, by the mathemat- Simulation results for DGP examples 2-5 in Figure 4 are ical expressions relating these quantities, also reduce the shown in Figures A1-A3. We use the sem function in the p-values. However, this is only reliably the case if the model lavaan library to estimate a single target effect for each is correctly specified, and the reason it happens is because variant. For the MAE and the p-values, we provide results we are able to partial out the variance more completely. For for a single effect of interest. For example, for the DGP re- example, consider the graph . Here, has search question 2ii in Figure 4, we specify the SEM models two causes, but let’s say that we actually only care about given in the ‘Full DGP’ and ‘Reduced’ columns and generate the link . In this case we have two options: create MAEs and -values for the total effect of on . Similarly, an SEM which includes (in addition to the for DGP research question 3ii, we specify the SEM models link), or create an SEM which does not. Note, however, that given in the ‘Full DGP’ and ‘Reduced’ columns, and gener- the inclusion of can help us estimate be- ate MAEs and -values for the total effect of on . Fi- cause it partials out variance in which, in a finite sample, nally, for example 5, we specify the SEM models given in might otherwise be attributable to . Unfortunately, in the ‘Full DGP’ and ‘Reduced’ columns, and generate MAEs practice it may not be as simple as this, because every time and -values for the total effect of on . we include a new variable and a new path, we also increase For each of the example DGPs, we generate data across the chances that we incorrectly specify the graph. Thus, a range of sample sizes (10-200), and for each sample size whilst the option to reduce standard error by the inclusion we undertake 100 simulations. The results of these 100 sim- of more paths is perhaps still a good thing to consider/un- ulations are used to derive means and standard deviations derstand in general, doing so requires us to be more and for each of the metrics, thus allow us to compare the results more confident that our specification is correct as we in- when specifying the full DGP model compared with the re- clude more and more paths in our model. duced models. Returning to the examples in the figure, the reduced Starting with the results for the model fit metrics in model in DGP 2i only includes two effects of the outcome Figure A1, we see that for DGPs 2-4 the reduced models , which is and . However, other more proximal vari- have better fit (lower indicates better fit). This comes as Collabra: Psychology 18 Prespecification of Structure for the Optimization of Data Collection and Analysis ables and exist, and their inclusion would improve the quality of the estimate. In this case, and would be dou- bling as both control variables (adjusting for the backdoor path from to , as well as variables which aid in precision . Note also that the standard deviation of these p-values is higher, indicating greater variation across simulations. This increased variance also results in a higher MAE, which is also evidence in the DGP2 - MAE plot in Figure A3 (third row, first column). Thus, even though the effect size esti- mates will be unbiased (owing to correct specification of the reduced model with respect to the full DGP), the removal of explanatory variables can impact the precision of the es- timates. In order to compensate for this, one can choose to retain variables which have explanatory power so long as their inclusion does not contradict the full, underlying model. DGP 2 represents a useful example insofar as vari- ables and can be included (optionally in addition to ), to help explain the effect of on . Collabra: Psychology 19 Prespecification of Structure for the Optimization of Data Collection and Analysis Figure A1. Simulation and Coefficient Estimation Results. Note. Averages and standard errors over 100 simulations with varying sample sizes for and estimated coefficient values for data generated from Data Generating Processes (DGPs) 2-5 in Figure 4. Collabra: Psychology 20 Prespecification of Structure for the Optimization of Data Collection and Analysis Figure A2. Simulation CFI and RMSEA Results. Note. Averages and standard errors over 100 simulations with varying sample sizes for Comparative Fit Index (CFI) and Root Mean Squared Error of Approximation (RMSEA) for data generated from Data Generating Processes (DGPs) 2-5 in Figure 4. Collabra: Psychology 21 Prespecification of Structure for the Optimization of Data Collection and Analysis Figure A3. Simulation p-value and MAE Results. Note. Averages and standard errors over 100 simulations with varying sample sizes for p-values and Mean Absolute Error (MAE) for data generated from Data Generating Processes (DGPs) 2-5 in Figure 4. Collabra: Psychology 22 Prespecification of Structure for the Optimization of Data Collection and Analysis SUPPLEMENTARY MATERIALS Peer Review History Download: http://collabra.pr-11647.scholastica-test.com/article/71300-prespecification-of-structure-for-the- optimization-of-data-collection-and-analysis/attachment/148422.docx?auth_token=xkgpNrDLLX2LywifUK-P COI_DAS Download: http://collabra.pr-11647.scholastica-test.com/article/71300-prespecification-of-structure-for-the- optimization-of-data-collection-and-analysis/attachment/148423.docx?auth_token=xkgpNrDLLX2LywifUK-P Collabra: Psychology
Collabra Psychology – University of California Press
Published: Feb 24, 2023
Keywords: Markovicity; data collection; conditional independence; causality; path modeling; structural equation modeling
Access the full text.
Sign up today, get DeepDyve free for 14 days.