In this chapter we discuss two major design choices that must be made in the planning process for an impact evaluation. These choices concern: whether to use an experimental (i.e., randomized program assignment) or non-experimental design, or some hybrid; and whether to evaluate each individual site independently or to pool the data from multiple sites and evaluate them jointly. Given our limited knowledge of fatherhood programs as well the resources that might be available to evaluate them, it is not appropriate to recommend which alternatives to select for an evaluation. Instead, we describe the options and discuss criteria to be considered in making the choices.
The criteria we discuss include:
We conclude the chapter with a summary of the most important points with respect to these criteria for each design feature.
A rigorous evaluation will require a "treatment" group the group that receives program services and a control or comparison group of some sort. In this section, we present three alternative designs for these two groups. The three designs are: a classic, experimental design, with randomized assignment to treatment and control groups; a non-experimental design that uses a non-randomly selected group of fathers who do not receive program services as a comparison group; and an intermediate design that we call the "randomized outreach" design. The last design is a modified experimental design that preserves enough of the experimental design's features to address what is likely to be the most problematic aspect of a non-experimental design, the bias in the estimates due to unobserved differences in the treatment and comparison groups, but also avoids some of the problems inherent in the experimental design.
We describe each design below, and discuss its strengths and weaknesses. Which design is best for an evaluation depends on both characteristics of the program and on the resources available for the evaluation. We discuss criteria for selecting among the three designs at the end of the section.
All three designs would use the same primary data collection methodology a baseline survey with at least one follow-up for both the treatment and control or comparison groups. For the treatment group, the baseline survey would collect information about the characteristics of fathers including their relationships with their children before program participation, while the first follow-up would collect outcome data and information on study participants' receipt of services from other programs shortly after the father has completed the program. For the control or treatment group, the baseline and follow-up surveys will collect the same information at comparable points in time. Details of the data collection plan appear in Chapter Six.
One other common feature of all three designs deserves mention here. Before selection of treatment and control or comparison group subjects, the evaluators would identify volunteers from specified populations to participate in a "long-term study of non-custodial fathers," not telling them that the purpose was to evaluate a particular program, and would offer incentives for volunteering to participate in the baseline and follow-up surveys. The purposes of this feature are: to reduce differences in the measured outcomes for the treatment and comparison or control group that are due to differences in the willingness of fathers to volunteer for, and complete, the study; and to disguise the fact that success of a particular program will be judged, in part, on the basis of their behavior.
Target Population
The experimental design (Exhibit 3.1) begins with the identification of the target population for the evaluation the population of fathers that the program targets for service. In general, this is the population of non-custodial fathers in the community that is served by the program, but it may be defined as the population of non-custodial fathers who come in contact with one or more recruitment or referral sources. An example is the maternity ward at a local hospital, in which case the target population is, at least in part, the non-custodial fathers of infants born to unwed mothers in that hospital who are contacted by the referral sources. Other sources may include: fathers of children participating in a local welfare program, fathers residing in a specific geographic area, or fathers who are incarcerated.
Study Volunteers
To conduct an evaluation, it is necessary to contact fathers in the target population and ask them to volunteer to participate in a study of non-custodial fathers and their children. This could be most easily accomplished by an outside referral source that, in the absence of the evaluation, would be in contact with the same fathers and would refer them to the program. The referral source would be asked, instead, to refer the father to the researchers conducting the study.
It may be necessary to offer fathers an inducement to participate in the study in order to obtain an adequate number and mix of volunteers. This could be a payment for responding to the baseline survey. An alternative is to give them an item that would benefit the child, but this might result in fewer volunteers per dollar spent on incentives and would skew the mix of volunteers towards those that are most motivated to benefit their children.
Baseline Survey and Random Assignment
Fathers who volunteer would then be contacted by the evaluators for the administration of the baseline survey. Following completion of the survey, the evaluators would refer randomly selected fathers to the program. These fathers would constitute the treatment group, and those not referred would be the control group. All fathers completing the baseline survey would be asked to provide the name, address and telephone number of at least one contact person individuals who "always know how to contact the father" so that they may be included in the follow-up survey. An incentive for completing the follow-up survey may be necessary to obtain a high participation rate, and the father should be informed of that incentive at this point.
Treatment group fathers would not necessarily participate in the program just because the evaluators refer them to the program. The evaluators could ensure a high participation rate among treatment group fathers by several means. First, a screen could be used to screen out potential study volunteers who are very unlike the program's participants. Characteristics of fathers that are rarely or never observed among participants could be determined with the assistance of the program, and used as the basis for the screen. Questions concerning the fathers interest in obtaining specific types of assistance might also be asked. Fathers identified as unlikely to participate would be screened out of both the treatment and control groups. Screened out fathers could be dropped from the study entirely, or the data collected from them might be used for an auxiliary, descriptive analysis.
Second, with the permission of the father, the evaluators would help the father get in touch with program staff, who would then use any means at their disposal to encourage participation. Note that any methods used to encourage participation of referred fathers become part of the treatment, because they are not offered to control group fathers.
An Alternative Experimental Design
An alternative experimental design that might achieve higher participation rates among treatment group members would ask fathers to volunteer for program participation before random assignment. This would screen out all fathers who, at least initially, did not want to participate. It would not, however, guarantee participation from all treatment group fathers because some might change their mind at a later date. Further, it would not give the program an opportunity to encourage participation by fathers who might otherwise not participate. This assumes that participation is largely voluntary. In a situation where fathers are "required" to participate, perhaps by court order as a condition of parole or visitation, this would not be an issue.
Another problem with the alternative approach is that control group fathers would be made aware of the program, would likely be disappointed at their assignment to the control group, and might significantly change their behavior as a result of that assignment. Further, both treatment and control group fathers would know that they are part of a study to evaluate the program, which might also change their behavior, whereas under the recommended approach volunteers would only be told that they are participating in a study of non-custodial fatherhood. These last two problems can also arise under the recommended approach, but to a substantially lesser degree.
Follow-Up Data Collection
Follow-up data should be collected from as many study volunteers as it is feasible to reinterview. Follow-up data should be collected after a specified interval following the baseline interview. The length of the interval should be long enough so that those who participate in the program are likely to have completed their participation, but not so much later that participants are likely to have forgotten significant information about their participation in the program or about immediate post-program outcomes. It is necessary to define a fixed interval after random assignment, rather than interview participants shortly after they complete the program, so that data collection for the control group will be comparable to data collection for the treatment group.
It is important to collect follow-up data from such "non-participating treatment" fathers fathers who were referred to the program but did not participate in order to adjust for estimation bias that would likely result if participating fathers alone were compared to control group fathers. Participating fathers are a self-selected subset of all treatment group fathers, and are likely to have a higher probability of positive outcomes than the average control group father even in the absence of participation.
Data Analysis
A full discussion of data analysis is deferred until later in the report (Chapters Seven and Eight). The discussion here is intended to indicate the nature of the analysis and to provide background for the discussion of criteria for selecting a design that appears later in this chapter.
Analysis of the data under an experimental design can be very simple because random assignment would eliminate all but chance differences between the baseline characteristics of the treatment and control groups. Differences between treatment and control group means of the outcome variables from the follow-up survey are the simplest measures of the program's impact.(1) There are, however, several reasons to use more complex analyses.
First, we assume that some treatment group members would not participate in the program. Difference in means estimates that exclude non-participating treatment group members from treatment group means likely overstate the effect of participation, due to the self-selection problem mentioned above. If, instead, non-participant treatment group members are included in calculating the treatment group mean, the difference in means is likely to understate the impact of the program for those who actually participated.
A simple way to obtain an unbiased impact for those who participated would be to divide the difference in mean outcomes between all treatment group and control group fathers by the proportion of treatment group fathers who participate. This approach follows from the expectation that the sample means for the treatment group will satisfy the following equation:
treatment mean - control mean = participation impact x % participating
where "treatment mean" refers to the mean of an outcome variable for the full treatment group, "control mean" is the corresponding mean for the control group, the "participation impact" is the percent effect of participation on the outcome variable, and "% participating" is the share of treatment group members who participate.(2)
Second, some members of both the treatment and control groups will drop out of the study, and there may be substantially more attrition from the control group than from the treatment group because of the latter's participation in the program. Hence, it may be important to study the determinants of attrition, and to make adjustments for differences in attrition between the control and treatment groups.
Third, more precise impact estimates can be obtained by controlling for characteristics of treatment and control group members that are measured in the baseline survey, including baseline outcome measures. Multivariate analyses that incorporate this information can be explain a significant share of the random variation in outcome across fathers, reducing the size of any remaining difference between treatment and control group means that is plausibly due to chance. This analysis would also include a multivariate analysis of program participation among treatment group members. As we discuss further in Chapter Seven, the results of the participation analysis would be interesting in their own right, as well as useful in improving the quality of the impact estimates.(3)
Example of a Possible Experimental Design
One of the programs we visited, the Racine Goodwill Industries Program, primarily serves fathers who are "referred" by the court system for failure to comply with child support orders. Services include employment services that are provided through an arrangement with another organization as part of Wisconsin's Children First Program, and a variety of other services, such as parenting and fatherhood responsibility courses, that are provided by Goodwill (see Appendix B for a more detailed description). Approximately 50 to 60 fathers are referred by the courts each month.
A randomized evaluation of the employment service component of the program alone is currently being conducted by the State as part of an evaluation of the Children First Program. This represents the only effort of which we are aware to formally evaluate a specific component of a fatherhood program. It serves as an example of an experimental design approach and illustrates some of the kinds of issues that fatherhood programs will face in conducting impact evaluations.
For this evaluation, fathers who are sent to the Goodwill by the court are randomly assigned into control and treatment groups. Treatment fathers receive employment services as well as other services provided by the program, while control fathers receive the other services alone. Thus, this evaluation focuses on the impact of the employment services conditional on receipt of the other services.
Preliminary findings from the State's evaluation were provided to us by the State (see Appendix D). They show that the mean child support paid by treatment group fathers increased by 76 percent from the six months before referral to six months after referral, while mean payments from control group fathers increased by 62 percent. By the second six months after referral, mean payments from control group fathers outpaced those from treatment group fathers up 82 percent from the six months before referral compared to 77 percent for treatment fathers. Other related outcome measures (number of payments made and number of fathers making payments) show similar findings.
There are several possible explanations for these results. One is that court enforcement per se, rather than services provided, account for improved payments. Another is that the fatherhood services provided by Goodwill, rather than the employment services, are the critical determinant of increased support. The latter conclusion is discounted by the fact that pre-post increases in support payments for the Children First Program in other Wisconsin counties appear to be as large as in Racine during this period, but these counties do not provide services that are comparable to fatherhood services provided by the Goodwill Industries Program in Racine.
Another explanation of the small differences in results for the treatment and control groups is possible spillover problems. The Goodwill counselors knew who the control subjects were, and, as reported to us, were uncomfortable with denying the subjects with services that they thought would be beneficial. The counselors faced an ethical problem, and the immediate needs of their clients may understandably have taken precedence over the evaluation's needs. While the counselors could not send their clients to obtain the employment services, they could provide compensating services.
It might be feasible to conduct the "reverse evaluation" by random assignment an evaluation of fatherhood services conditional on receipt of the employment services although there may be institutional obstacles to such an evaluation. This evaluation would show whether the package of fatherhood services provided directly by Goodwill "adds value" to the employment services. The reverse evaluation would also examine a broader range of outcome measures, rather than focusing on child support. A spillover problem could arise here too. To reduce this problem, the courts might refer the control subjects those receiving employment services only directly to the provider of those services, avoiding contact with Goodwill Industries staff. Of course, staff providing employment services might find themselves in the same bind as Goodwill staff did in the evaluation of the employment services.
An experimental evaluation of the combined services might be more useful to program funders, but would be more problematic. According to the child support enforcement office, the alternative to assigning fathers to the program is sending them to jail, something they are prepared to recommend! Further, the program can accommodate all fathers who are currently referred, so the program's manager is not willing to deny services to fathers who would otherwise be clients.
Discussion
It may not be feasible to implement an experimental design. A randomized design would likely require cooperation from referral sources including asking them to not refer some clients who might otherwise be referred. Referral sources and others are likely to object to this on ethical grounds because some fathers would be denied services that, in the absence of the evaluation, they might receive. This is especially likely to be true if the program has the capacity to accommodate all referrals.
While many sources of estimator bias that are avoided with an experimental design, potential bias remains because the study is not "blind." Program staff are likely to know their program is being evaluated, and it may behave differently as a result. Study volunteers, staff at the referral sources, and others may also learn about the purpose of the study and also alter their behavior to influence the outcome.
The "non-blind" nature of the study will be a problem for any of the designs we are considering, but may be more problematic for this design than for a non-experimental design because treatment and control subjects may be likely to come in contact with one another they come from the same target population and are in contact with the same referral sources. "Spillovers" information obtained by control group fathers from treatment fathers, competition between control and treatment fathers, disparagement of the treatment by control group fathers, alternative services obtained by control group fathers, etc. will all affect impact estimates.
The size of the program to be evaluated may be too small to generate a sample size that is large enough to yield sufficiently precise estimates. Based on the two programs we have examined to date, the evaluators would be fortunate to obtain 200 subjects from a single site over a one-year period. If 100 were assigned to treatment and 100 to control, a simple difference in percent would have to be at least 12 percentage points to be statistically significant at the five percent level.(4) This can be improved upon to some extent by using multivariate methods, but the estimates are likely to be inadequately precise for many purposes with groups of this size. Other options to improve precision would be to pool data from multiple sites or extend the sample collection period, both of which may have other problems. Problems with pooling data from multiple sites are discussed later in the chapter. Lengthening the sample collection period would delay completion of the study and would increase the chance that the evaluation would be compromised by significant changes in the program, its environment, or the evaluator's staff.
Any high quality impact evaluation will be costly. For an experimental design, significant cost sources will include: developing a detailed plan, including instruments; implementing the methods for soliciting volunteers; conducting the baseline survey and randomly assigning them to treatment and control groups; maintaining contact with study participants and conducting the follow-up survey; preparing the data; analyzing the data; and disseminating the findings. Except for costs incurred to randomly assign volunteers, the costs for each component would likely be no larger than they would be under alternative designs.
Target Populations
For the non-experimental design, the evaluators would identify two distinct separate treatment and comparison group target populations (Exhibit 3.2). The treatment group target population would be for the population served by the program to be evaluated the same population that would be the target population for the whole evaluation under an experimental design. The comparison group would be a population that is not served by the program or a comparable program, but is otherwise very similar to the program's target population. Thus, for instance, if the target population for the program is non-custodial fathers of newborns at a specific hospital, the target population for the comparison group could be non-custodial fathers of newborns at one or more similar hospitals that are not served by the program or a comparable program. If instead, the target population is non-custodial fathers within a specific geographic area, the comparison population would be the corresponding population in a geographic area that is similar in socioeconomic characteristics.(5)
Study Volunteers
Under this design it would be necessary to solicit study volunteers from both treatment and comparison populations in an identical way. The purpose of identical solicitation is to obtain two sets of volunteers that are as comparable as is feasible. Informing potential volunteers from the treatment population that they will have an opportunity to participate in the program is likely to get a set of volunteers that differs from the comparison volunteers in a way that is difficult to measure related to the volunteers' desire to participate in the program. Incentives to participate in the study might be required to obtain a desirable number and mix of study participants, just as in the experimental design.
Baseline Survey
As in the experimental design, a baseline survey would be conducted by the evaluator once contact is established with the study volunteer. Following the completion of each interview, the respondent would be asked to keep in touch through a contact person, and to eventually participate in a follow-up survey. Respondents from the treatment population would all be referred to the program, through the same process used for randomly assigned treatment fathers in the experimental design.
Follow-up Data Collection
Follow-up data would be collected in the same manner as was described under the experimental design, including data for volunteers from the treatment population who elect not to participate in the program.
Data Analysis
While differences in means of outcome variables could be used to estimate program impacts, such estimates are likely to be biased because of systematic differences between the underlying treatment and comparison populations. Many such differences are likely to be reflected in baseline characteristics of the treatment and comparison group volunteers. Just as in the experimental design, these characteristics can be incorporated in a multivariate analysis to control for observed baseline differences between the groups.
Collection of high quality baseline data and multivariate analysis of outcomes is more critical for the non-experimental design than for the experimental design because baseline differences between the non-experimental treatment and comparison groups are not just due to chance, may be substantial, and may have a strong association with key outcomes. Even after controlling for observed differences in baseline characteristics, remaining differences between outcomes for the two groups may reflect unobserved differences in baseline characteristics. The main weakness of the non-experimental design is that it is not possible to adjust for those differences which are not observed in the baseline data.
Example of a Non-Experimental Design
It may be feasible to conduct a non-experimental impact evaluation of the Racine Goodwill Industries Program, using one or more other counties in Wisconsin as comparison counties. Recall that the program primarily provides services to fathers who are referred by the courts as a means to increase child support payments. As mention previously, other counties in Wisconsin offer more limited services employment services only, under the Children First Program.
The State has already collected data that could be used for a limited version of such an evaluation pre-referral and post-referral child support data for non-custodial fathers who have been referred by the courts to the Children First Program.(6) This program is operational in other Wisconsin counties and provides limited employment services to fathers who are referred by the county courts.
An evaluation that would be more in line with the non-experimental design presented here and that would broaden the outcome variables beyond measures of child support would require interviews of fathers who are involved in court actions concerning child support at the time these actions are beginning (i.e., the baseline survey), with follow-up interviews several months later (i.e. the follow-up survey). The baseline interviews are especially important because the characteristics of fathers who are subject to court actions, and the nature of those actions, may differ markedly across counties.
This type of an evaluation would compare the effectiveness of Racine's program to the effectiveness of the Children First Programs that are in place in the comparison county(ies). Hence, the evaluation would be limited to analyzing the added impact of the services provided by Goodwill Industries that augment the "customary" Children First employment services. Note that this is also the limited goal of the experimental design for the Racine program that was outlined in the previous section.
A non-experimental design might also be considered for the Baltimore City Healthy Start Men's Services Program. This program is established in two Baltimore areas, East and West Baltimore, and together they serve from 50 to 100 men each year. The Baltimore site is one of 15 Healthy Start programs nationwide. The fathers who participate in the Baltimore Men's Services are non-custodial fathers who are recruited through their children's mothers; the latter are participants in the Healthy Start program. The program is well established. It is obviously too small to apply any experimental design. The number of fathers served annually is small for a non-experimental design also, and efforts to increase the number served during the evaluation period would be desirable. Alternatively, Healthy Start programs that provide similar services to fathers in other cities may exist and could be evaluated jointly with the Baltimore program if the programs are sufficiently similar.
The main goal of the overall Healthy Start program is to reduce adverse birth outcomes, through increased use of appropriate prenatal, post-partum, and pediatric care. A non-experimental evaluation of the main program is already being conducted, using an adjacent Baltimore area as the comparison site. The comparison area has changed considerably since program implementation, however, and may no longer be suitable as a comparison area, but others may be available.
To find fathers for the study from the comparison area, it would be desirable to recruit them in a manner similar to the manner used by Healthy Start. This will be difficult because there is no set of Healthy Start mothers in the comparison area. One approach would be to use Health Start's methods for identifying mothers, then use the mothers to find the fathers. This is cumbersome, however.
Another potential problem with this approach to evaluating Healthy Start Men's Services is that it would really evaluate the impacts of all Healthy Start services, including the Men's Services, because children and mothers in the comparison areas would not be receiving other Healthy Start services. If, instead, comparison mothers were selected from Healthy Start programs in other cities that do not men's services, the impact of Men's Services alone could be evaluated. Determining whether this is possible would require review of the programs in other cities. Differences in the economic, cultural, and policy climate in Baltimore and other cities would also make this design problematic.
Discussion
It is usually more feasible to implement a non-experimental design than an experimental design. This type of design does not normally have an impact on services that responsible fathers would be getting; i.e., those in the treatment group would participate in the program just as they would or would not in the absence of the evaluation, and those in the control group would presumably receive the same services, if any, that they would have received in the absence of the evaluation.
There may be other challenges to feasibility, however. First, a reasonable comparison group must be found, and it may be difficult to find one that is sufficiently similar to the treatment group before treatment in all important respects. Second, collection of data from the comparison group is likely to require cooperation from agencies that serve the comparison population agencies that would refer fathers to the program were the program located in their community. Their cooperation seems less likely than the cooperation of agencies that actually make referrals to the program. Generally, collecting comparable data from members of two different target populations is likely to be more problematic than collecting data from members of a single population, as would be required under an experimental design.
The non-experimental design would be much less vulnerable to the spillover effects that might bias estimates under an experimental design, but bias may be a significant problem for other reasons. The most serious is likely to be differences between the separate target populations from which the two groups are drawn. While baseline data can be used to control for the effects of observed differences in treatment and control group members on outcomes, this will be imperfect. Another source of bias is environmental factors the local labor market, other community services, etc. which may differ substantially across the two groups. Differences in outcomes may reflect differences in environmental factors. Differences that remain constant throughout the evaluation period can be controlled for by comparing changes in outcome variables (i.e., follow-up outcome values minus values) for the two groups, rather than the levels of follow-up values. Changes in environmental factors that are different for the two groups (e.g., labor market improvement in one area, but not the other, or changes in the policy environment) would be difficult to control for in the analysis.
For a given sample size, the estimates from a non-experimental design will be less precise than those from an experimental design, depending on how well matched the two groups are.(7) It is likely, however, that a larger sample size can be achieved over a given period of time because the constraint imposed by the program's size applies only to the treatment group, rather than to the combined treatment and control groups. If the comparison group is the same size as the treatment group, then the sample size is potentially twice as large as for an experimental design with equal size treatment and control groups. Thus, in our hypothetical program that has 200 participants per year, the size of the study sample over a one-year period would be 400, rather than 200. This reduces the size of a difference in percent that is statistically significant from 12 percentage points to eight.
For a sample of given size, it may cost more to collect data under this design than under the experimental design because the volunteers would be obtained from a greater number of sources (e.g., referral agencies or geographic areas). Data collection costs will be increased further if the larger sample size that can be achieved under the non-experimental design is sought. Also, because the importance of controlling for baseline characteristics is more important to prevent bias under the non-experimental design than under the experimental design, the evaluator may wish to put more effort into designing and conducting the baseline survey.
C. Randomized Outreach Design
Random Outreach vs. Random Referral
The randomized outreach design (Exhibit 3.3) modifies the experimental design in the following simple way. Under the experimental design, randomly selected volunteers are referred to the program, while those not selected are not referred at all. Under the randomized outreach design, all volunteers are referred to the program, but extraordinary efforts are made to encourage participation of a randomly selected subgroup the "outreach treatment" group. For instance, while all volunteers would be offered an incentive to continue to participate in the study through follow-up, those selected for the outreach treatment group might be offered a larger incentive if they also participated in the program. Alternatively, researchers or program staff might: more actively "sell" the program to randomly selected volunteers; contact volunteers a few days after the interview to check if they have enrolled and, if not, encourage them further; offer transportation to the program office, etc.
There are several reasons for modifying the experimental design in this way. First, it addresses the ethical problem that may thwart implementation of the experimental design by giving every volunteer an opportunity to participate. Relative to the existing program, it will not deny or discourage anyone from participating; instead it will provide added encouragement to a subset of potential participants.
Second, the spillover effects that might occur under the experimental design are largely avoided. Control group fathers who decide they want to participate will be allowed to participate, so the potential for rivalry between the two groups is greatly reduced. In fact, members of both groups may be unaware that they have been assigned to one group or the other, or even that the purpose of the study is to evaluation the program. "Blindness" of subject is most likely to be achieved if the treatment is limited to extraordinary follow-up marketing activities that would be difficult for volunteers to detect. Use of special incentive payments would be easier for volunteers to detect and, if detected, have an impact on their behavior.
This modification achieves these advantages over the experimental design but preserves the most important feature of the experimental design: differences in outcomes between the control and treatment groups that are not caused by the treatment are due to chance and will be small if the sample is sufficiently large. Here, however, the treatment is not the program, but rather the outreach.
An additional analytical step is necessary to convert the outcome differences into estimates of program effects, as discussed further below.
Participants and Non-participants
Under this design, a substantial number of control group members will participate in the program. If the randomized outreach is effective in increasing participation, the share of control group members who participate will be smaller than the share of treatment group members who participate. Data must be collected for both participants and non-participants from both groups.
Data Analysis
The outcome analysis under this design would compare outcomes from the participant and non-participant groups, using the randomized outreach feature to correct for bias due to self-selection of volunteers into the participant and non-participant groups. If we assume that the effect of participation on an outcome is the same for all volunteers who participate, we would need to divide the difference between treatment and control outcomes by the difference in participation rates to obtain the estimated participation impact. This follows from the expectation that:
treatment mean - control mean = participation impact x % treatment participation - participation impact x % control participation
where "% treatment participation" is the percent of the treatment group that chooses to participate, "% control participation" is the analogous control group variable, and other variables are as defined previously.(8) The formula presented previously for estimating participation impacts under the experimental design is the special case of this formula when "% control participation" is zero.
As in the experimental model, more accurate estimates of participation effects can be gained through multivariate analysis of outcomes, incorporating control variables from the baseline survey. The outcome analysis would be preceded by a participation analysis that would examine the effect of the randomized outreach method and other variables on participation. The results of this preliminary analysis would be incorporated in the estimation of multivariate outcome models to adjust for self-selection into the program, with a variable identifying which subjects received the random outreach.(9) As in the experimental design, the participation analysis itself would be of interest perhaps more so because one objective of the evaluation could be to test the outreach methodology. Further, if the number of volunteers is sufficiently large, two or more outreach methodologies could be tried.
There are at least two threats to the success of this approach that reduce its potential usefulness. First, if the outreach is ineffective, participation rates and outcomes for the two groups will be very similar and the measured effect of the program will be insignificant even if the true impact of the program is substantial. Hence, success of this approach requires a treatment outreach that is very effective in comparison to the control outreach.
Second, the program probably does not have the same impact on all fathers, and it may be that the impacts on participants from the treatment group who would not have participated had they received the control outreach are substantially greater or less than those on other participants. On the one hand, these "marginal" participants might be fathers who are motivated by the outreach and not by a strong desire to become responsible fathers, in which case impacts may be small. On the other hand, in comparison to other participants, marginal participants may be fathers who would be least likely to achieve desirable outcomes on their own, in which case impacts may be large. To minimize any potential bias, the evaluators will need to examine differences in baseline characteristics between treatment participants and control participants and investigate whether program impacts are related to these observed differences. Evaluators will not be able to adjust for differences between marginal participants and other participants that are not observed.
Examples of Random Outreach Designs
It might be feasible to conduct an evaluation of the Racine Goodwill Industries program using a random outreach design. According to staff we interviewed, it would not be difficult to find many more fathers to participate in the program in a short period, and the program would welcome an opportunity to reach out to more fathers, even if only some fathers reached are referred to the program.
For this evaluation, the evaluator and the program would cooperate to recruit study volunteers. Recruitment could be accomplished through the many AFDC mothers who are in contact with Goodwill Industries because Goodwill administers the JOBS program in Racine. Alternatively, fathers who are program clients might be employed to recruit other fathers they can contact through informal connections. This might be especially useful for obtaining volunteers from among fathers who are the most difficult to reach.
Volunteers would be asked to participate in the baseline survey. Randomly selected volunteers would be encouraged to participate in the program by the interviewer. Follow-up outreach to these same "treatment group" volunteers could be conducted by the program or program clients. Inevitably some of the volunteers who do not receive the outreach (the control group) will participate in the program, but this is not a threat to the evaluation as long as the outreach efforts applied to the randomly selected volunteers produce a substantially higher participation rate among volunteers assigned to the treatment group.
It should be recognized that this evaluation would not result in estimates of the impact of the program on outcomes for fathers recruited through the program's main referral mechanism the courts. Instead, it would estimate the impact of the program on fathers recruited through whatever mechanism is adopted. This estimate may be no less interesting than an estimate for fathers referred by the courts would be, but it must be recognized that results are dependent on the recruiting process.
One interesting "side-effect" of a new recruitment effort might be a reduction in referrals from the courts. This can easily be tested by the evaluator, by comparing the number of court referrals for fathers who are in the control group to the number for those in the treatment group.
A random outreach design might also work for the Baltimore Healthy Start Men's Services Program. This program recruits fathers through mothers who are Healthy Start participants themselves. It may be feasible to recruit a much larger set of fathers by these means to volunteer for a study on non-custodial fathers. Financial or other inducements might be used to recruit randomly selected volunteers for participation in the program. It would be necessary to increase the number of participants recruited well above current levels to make such an evaluation viable, but this may be possible. One advantage this design would have over the non-experimental design for the Baltimore program outlined in the previous section, using fathers from adjacent areas in Baltimore for the comparison group, is that it would evaluate the impact of the Men's Services conditional on the other Healthy Start services, rather than the impact of all Healthy Start Services provided by the Baltimore program.
Discussion
While this design has some very positive features, other factors may make it less attractive relative to other designs. First, as with the experimental design, this design will require some cooperation from normal referral sources, whose help may be needed to implement the randomized outreach. Second, for a sample of given size estimator precision may be substantially lower under this design than under either the experimental or non-experimental designs. How much lower will to depend on the effectiveness of the treatment outreach relative to that of the control outreach the less effective, the lower the precision. Severe sample size constraints due to program size or costs would make this design unattractive.
Two factors other than sample size increase the cost of this design relative to the experimental design: the cost of the randomized outreach and some additional complexity in the analysis of the data.
A potentially important advantage of this design over the alternatives is that it provides the opportunity to study the impact of the treatment outreach relative to the control outreach. The evaluator could determine the impact of the treatment on participation and could also determine whether the eventual effect on outcomes. Outreach may be a cost-effective method of improving outcomes for non-custodial fathers. The treatment outreach need not be limited to a single outreach method; multiple outreach methods could be randomly assigned.
Funding may permit evaluation of multiple responsible fatherhood programs in the future, and this design is intended as a road map for conducting evaluations of many different sites. The evaluation of each site could be conducted independently. This would allow the evaluation design and data collection methodology to be tailored to each site's circumstances. Tailoring the evaluation in this way might maximize information gained about each site, but would also make it difficult to compare findings across sites. Under a non-experimental design, the evaluator may want to use a design that does not require a separate comparison group for each site; in the extreme, a single comparison group may be used for all sites.
If the programs are sufficiently homogeneous, there would be a significant advantage to evaluating multiple sites jointly; pooling the data across sites would increase sample sizes and contribute to more precise estimates of program and other effects. This may be especially valuable because the programs we are familiar with are all small, and sample sizes from individual sites will be small unless the evaluation is conducted over a very long period. Sample size constraints are of greatest concern if either the experimental or randomized outreach design are used. Evaluation of homogeneous programs in multiple sites can also provide information about the impact of local environments on the efficacy of the program.
The programs are clearly not homogeneous, however, so it is less obvious that joint evaluation of multiple sites would be advantageous.(10) Heterogeneity across programs has many dimensions. A multi-site evaluation can be designed to accommodate some dimensions of heterogeneity successfully, but not others. We discuss three major dimensions of heterogeneity below: program services, program objectives, and target populations. Of these, heterogeneity in target populations poses the greatest challenge to a multi-site evaluation. We would not rule out joint evaluations of programs with heterogeneous populations, but would urge that caution be exercised before proceeding.
Differences across sites in the types of services offered by programs can easily be accommodated in an evaluation of sites that are homogeneous in other key respects. The evaluator can easily allow for different programmatic impacts across sites. The evaluator can determine whether differences in impacts across sites are statistically significant, but in general will not be able to determine whether differences are due to specific program features or to environmental factors.(11) If this is the only substantial difference between multiple sites, no matter how many, it would make statistical sense to pool their evaluations because the evaluator can take advantage of the fact that effects of other factors (i.e., control variables) on outcome variables are likely to be similar across sites to improve the precision of the estimates (see Chapter Eight). This may be true for selected groups of responsible fatherhood programs.
If there are a very large number of sites that differ only in services provided, and if their programs can be classified in a meaningful way, the evaluator might also be able to demonstrate that some program features, and/or some local factors, are important determinants of success. This scenario appears unlikely for responsible fatherhood programs, however, because they are small in number and heterogeneous in other respects that are less amenable to multi-site evaluations.
We have observed substantial variation in program objectives for clients across the programs with which we are familiar. This variation is likely to be reflected in the impact of the program on various outcome variables that might be used in an evaluation. For instance, a program that places primary emphasis on helping the father obtain employment is likely to have a different impact on employment outcomes than one that focuses more directly on establishing or improving the relationship between the father, his child, and the child's mother. In an extreme case, an outcome variable that seems an appropriate one for one program, given the program's objectives, may seem inappropriate for another program that has different objectives.
Differences in program objectives alone, however, should not stand in the way of multi-site evaluations. It must be recognized that differing objectives result in variation in services (see above) and are likely to be reflected in variation in measured program impacts across the multiple outcome variables. Effects of other factors (i.e., control variables) on outcomes are likely to be similar across sites, so it makes statistical sense to pool the data, but allow for cross-site variation in impacts.
Although multi-site evaluations of programs with differing objectives may be statistically advantageous, there are some negative aspects of such evaluations. First, the program staff may not want to have the impacts of their programs compared to those for other programs on outcomes they may regard as tangential to their primary objectives. Second, a multi-site evaluation will require collection of common data at all sites, some of which might not be collected from all sites if individual evaluations were conducted. Further, data that may have unique importance to one site might collected for an evaluation of that site alone, but might not be collected for a multi-site evaluation.
There is substantial variation in target populations across the sites that we have observed, and this variation is the greatest challenge to multi-site evaluations. Evaluations of programs that have similar target populations may be pooled successfully, whereas pooling evaluations of programs with dissimilar target populations would not be very useful and could be misleading. As in the previous section, target populations are sometimes implicitly defined by methods used by programs to identify and recruit fathers, so similarity in these methods across programs may be required to make joint evaluation attractive.
It might be reasonable, for instance, to pool the evaluations of programs that target non-custodial fathers of newborns, especially if those fathers are from communities that have similar demographic and socioeconomic characteristics, and are identified and recruited in a similar fashion (e.g., through the maternity ward at a community hospital). As another example, it might also be reasonable to pool the evaluations of programs that target all low-income non-custodial fathers in a defined geographic area, especially if the areas have similar demographic and socioeconomic characteristics and if fathers are identified and recruited in a similar fashion. For instance, joint evaluation of the multiple IRFFR sites may be reasonable, although further review of the target populations and methods used to identify and recruit fathers at IRFFR sites may be advisable before making such a determination.
The reason that pooling data from sites with similar target populations is attractive, regardless of variation in program objectives and/or services, is that the effects of other, non-programmatic variables on outcome variables is likely to be similar across populations and pooling the data will improve the evaluator's ability to control for such factors. If, however, the target populations are very dissimilar, the effects of non-programmatic variables on outcome variables may vary substantially across the target populations. There would then be no advantage to pooling, and potential harm. We would be skeptical, for instance, about joint evaluation of a program that targets non-custodial fathers of children in Head Start programs with a program that targets non-custodial fathers who have been identified through the criminal justice system. This would not be as big of a concern, however, if the primary purpose of the evaluation were to determine if the program treatment works equally well in different populations. Information on the characteristics of both populations would still need to collected in order to explain why the impact of the program differed between the groups, if significant differences were observed.
Potential harm from joint evaluations of programs with disparate target populations may occur in several ways. First, the evaluators may unnecessarily constrain their data collection activities in each site because they plan to use the data for a common evaluation. Second, data may not be comparable across sites because of differences in data collection methodologies that must be implemented to accommodate differences in the target populations. Third, the evaluators may pool the data without testing whether the effects of control variables in the multiple populations are similar, which could lead to biased estimates of program impacts. The latter problem can be avoided through appropriate testing, but if sample sizes are small the power of the tests their ability to detect important differences in the effects of the control variables may be low.
In the introduction we presented five broad criteria for selecting the major design features. The most important aspects of each design feature with respect to each of the criteria are summarized in Exhibit 3.4.
We recommend that an experimental design be carefully considered before considering alternatives. A carefully implemented experimental design will provide the highest quality findings, and findings that are least able to be challenged. It may be that ethical or practical considerations will make an experimental design unattractive, or that potential sample sizes are too small. The randomized outreach design addresses some of the ethical and practical issues that may make the experimental design unfeasible, while preserving the use of randomization to control for differences in unobserved factors. It, too, may not be feasible or may be too costly. Impediments to implementing a non-experimental design are easier to overcome, and sample sizes obtainable may be larger, but questions concerning the adequacy of controls for differences between treatment and comparison groups are likely to arise.
We also recommend that joint evaluations of multiple sites be carefully considered. While there may be important reasons not to pursue this option, the gains from increasing sample sizes and improving comparability of findings across sites could be very large.
| Alternative | Feasibility | Impact Estimator Bias | Estimator Precision | Cost | Other |
|---|---|---|---|---|---|
| Experimental Design | Problematic for programs with excess capacity
Ethical concerns likely May require cooperation of referral sources |
Spillover effects likely
Best way to control for participant/non-participant differences |
Likely to be constrained by small sample size | Likely to be least expensive alternative | Easiest to interpret |
| Non-Experimental Design | Requires identification of reasonable comparison population
Requires collection of data from comparison population |
Unobserved differences between treatment and comparison groups may not
be adequately controlled
Outcome differences may reflect environmental differences |
Less likely to be constrained by small sample size than experimental design | Data collection will be more expensive than under an experimental design,
holding sample size constant, because it will come from two populations
Obtaining the larger sample that this design makes possible will also add to cost |
|
| Random Outreach Design | Requires implementation of random outreach
May require cooperation of referral sources |
May be biased if program has different impact on participants induced
by random outreach than on others
Preserves use of randomization to control for participant/non-participant differences |
Relies on effectiveness of random outreach
More likely to be constrained by small sample size than experimental design |
Requires larger sample than experimental design for given precision
Outreach may be costly Analysis is somewhat more complex |
Can analyze effectiveness of experimental outreach
Little or not experience in use of method to evaluate other programs |
| Independent Evaluation of Multiple Sites | Evaluation may be tailored for each site's program and circumstances
Design and data collection constraints in one site need not constrain design in other sites |
No problems other than as above | Will be very poor for small sites | Same comparison group may be used for multiple sites in non-experimental design | Cross-site differences in data collection and analyses may make comparison of results problematic |
| Joint Evaluation of Multiple Sites | Requires reasonable comparability of target populations
Common evaluation methodology for all sites may not be the best design for any single site |
Inappropriate pooling can cause bias, but this can be tested | Precision is substantially enhanced through pooling of data if target populations are sufficiently similar across sites | Costs to resolve cross-site differences and coordination requirements
may be significant
Economies of scale from multi-site evaluation will be realized Joint analysis of data is slightly more costly than separate analyses |
Comparability of results across sites will be assured
Cross-site variation in data collection methodology may cause bias or reduce the potential gains in estimator precision |
1. Differences in means include differences in percents for outcome variables that indicate whether or not an outcome for an individual satisfies a specific condition (e.g., has visited with the child at least once in the past week).
2. See Bloom, H.S. (1984). "Accounting for No-Shows in Experimental Evaluation Designs." Evaluation Review, vol. 8 (April), pp. 225-246. This simple formula relies on the assumption that participation has a constant effect on the outcome expected for an individual in the absence of participation, which may be incorrect. An equally simple formula is applicable under the assumption that the size of the impact for an individual is proportional to the individual's outcome in the absence of participation. See Chapter Eight for a discussion of other possibilities that allow for interactions between the magnitude of the impact and baseline characteristics of fathers.
3. As mentioned in a previous footnote, the magnitude of the program's impact may vary with baseline characteristics of the father. This issue could be conveniently studied in the context of the multivariate analysis.
4. This assumes a one-tailed test. See Exhibit VI.1.
5. Some program evaluations use non-participants from a program's target population, and/or program dropouts, for the comparison group. This approach is problematic because participants are self selected. Clever use of the data can sometimes solve the self-selection problem. See Bell, S. et al. (1995) Program Applicants as a Comparison Group in Evaluating Training Programs, Upjohn Institute: Kalamazoo, MI.
6. An earlier evaluation compared similar data for Racine and Fond du Lac Counties, the two pilot counties for Children's First. At the time (before 1991), the Racine program offered substantially more employment services than the Fond du Lac program, through JOBS, but not other substantial services. Measured impacts for the Racine program were substantially greater than for the Fond du Lac program.
7. See Goldberger, A.S. (1972) "Selection Bias in Evaluating Treatment Effects," Discussion Paper 123-72, Institute for Research on Poverty, University of Wisconsin-Madison.
8. As in the formula presented previously for the experimental design, this formula assumes that the impact of participation is the same for all participants. Interactions between participation impacts and baseline characteristics of participants can be incorporated in multivariate models.
9. For those familiar with multivariate selectivity models, the participation results would be used to construct an instrument for a dummy variable that identifies participants. The instrument's value would, in part, depend on the randomized outreach indicator and would be key to avoiding high collinearity between the instrument and control variables that might appear in the outcome equation.
10. Even if there is no statistical advantage to joint evaluation of multiple sites, it may be economically efficient to have a single evaluator evaluate multiple sites simultaneously. There will be many common features of data collection instruments and other aspects of the evaluation, and experienced gained in implementing an evaluation of one site will benefit evaluations of other sites.
11. A process evaluation of each site would likely provide explanations for variation in program impacts across sites, although they would not be definitive.