## Grant/Protocol Development

**Contact Information**

We are committed to assisting you as you develop your grant/protocol. See Role of Biostatistics for the many ways in which we can contribute to your research goals. Although you may be collaborating with several members of the department, your primary contact should be the lead statistician assigned to work with you. Over the years, we have established relationships with many of the research groups on campus. To see if there is someone already designated to assist you, please refer to the document Contact Information Although certainly not inclusive, this list contains many of the major project areas we are involved in.

If you do not feel your project fits into any of the applied areas outlined in the Contact Information section, please contact Dr. Barry Katz ( , 274-2674) or George Eckert ( , 274-2884) for further assistance.

**Information Needed for Grant/Protocol Development**

The primary information we need to begin the process of developing the data management, sample size justification, and analysis plan for the proposed grant/protocol is a draft of the grant/protocol itself. The Specific Aims and Research Design and Methods sections are of utmost importance, but having the Background and Significance and Preliminary Studies sections are also important because they can provide valuable information for developing both the sample size justification and overall analysis plan.

**General Data Management Methods**

A complete grant application/protocol needs to have a comprehensive data management plan that describes how the data will be collected, stored, manipulated and processed. The data management plan should include details about the choice of software, description of the data entry systems, proposed quality control methods and security precautions.

*Data Management Team*

The data management team which is often identified during the proposal stage, will generally include a primary data manager and a database technician. For larger proposals a second data manager and a data management assistant may also be included. If the Department is responsible for any data entry, a data entry clerk will also be part of the data management team. The primary data manager is responsible for all aspects of data management, and the database technician will assist in programming, verification and printing tasks. Data entry clerks are responsible for entry, verification and filing of data.

**General Statistical Methods**

A complete grant application/protocol needs to have a sample size justification and a thorough analysis plan that tests the hypotheses presented in the specific aims and includes choice of technique , validation of assumptions , and alternative plans when the assumptions are violated. This may even require some mention of new methodological development . External review committees will evaluate the appropriateness of these sections but so will internal review panels like the IRB and Animal Care Committee and also specific scientific review groups like those associated with the General Clinical Research Center (GCRC) and the Cancer Center.

*Statistical Analysis Team*

The analysis team is usually identified during the proposal stage. Large research projects will generally involve a statistical team made up of a biostatistics faculty member and a masters-level staff biostatistician. It is not unusual for the analysis team of a center, program project or large data oriented study to contain a second faculty member and/or staff biostatistician. Also, the team may include a faculty member with special expertise in a particular statistical area that is required for the project. A senior staff biostatistician might handle smaller projects independently. Occasionally, the team might only include the faculty member when the analysis role is less intense but additional input on the final design or the choice of techniques is needed during the project.

*Study Design*

The design of the research study is a key part of the collaborative process. Clearly, the design must produce data that can be used to meet all of the aims and test all of the hypotheses in the study. Most projects require simple straightforward designs that are often arrived at early in the process. Yet, even for these, a statistical approach may serve to minimize the required sample through matching or adjustment for important demographic or clinical factors. The protocol might also be streamlined in other ways such as reducing the length of the study or the number of data collection points. It is impossible to all of the factors that might be considered for every study, but some general examples follow. In epidemiological studies, the prevalence of a disease or condition is often a determining factor when choosing between a cohort and case control study design. However, the strengths and weaknesses of each must be carefully considered as well. Clinical databases on campus allow us to also consider a third choice when the needed data are collected routinely, retrospective cohort studies. For randomized clinical trials, the randomization plan is often a key element and the need for a multi-center trial often hinges on issues of statistical power. In some cases, where subjects are few and intervention effects transitory, a crossover design can be used. Finally, even in laboratory based studies, efficient designs can save both effort and resources. For example, when multiple factors are being studied and higher order interactions can be ignored, a fractional factorial design can greatly reduce the needed sample size while preserving statistical power for the tests of key hypotheses. In the design stage, it is never too soon to begin collaborating with a biostatistician.

*Sample Size Justification*

All studies should have a justification for the sample size to be used, whether the samples are people, rats, cell cultures or any other experimental unit. In some cases a review committee will accept historical precedent (i.e. I always use 6 per group, my mentor used 6 per group and his mentor used 6 per group), but a statistical justification based on the power of analysis to detect a significantly important difference is certainly preferable and is now being required by more and more funding agencies. This is certainly understandable since sample sizes that are too large waste resources, and those that are too small lead to inconclusive studies. For simple studies with continuous outcomes, calculations depend on the expected means (or the minimally important difference between means) and the standard deviation/variance. For comparing proportions, only the expected proportions are needed. These data are generally available from pilot data or past studies with which the investigator was involved or can often be obtained from the literature. When those quantities are available, sample size estimates can be easily calculated. Similarly, calculations for studies to detect linear association using correlation coefficients require only the magnitude of the correlation coefficient to be detected. Links to some simple sample size and power web-based calculators are http://calculators.stat.ucla.edu/ or http://www.stat.uiowa.edu/%7Erlenth/Power/.Generally, at least 80% power, and sometimes 90%, is needed for a well designed study.

More complicated studies generally require specialized software and additional biostatistical input to calculate sample size. Studies involving multiple regression models (linear or logistic) usually require additional knowledge about the strength of the associations of all the factors with the outcome and with each other. Similarly, calculations for survival analysis require knowledge of the distribution of deaths (or failures) over time and the pattern of censoring. Correlated data often result from multiple observations on the same individual. This happens over time in longitudinal studies and also within the same time point, such as measures on two eyes in ophthalmology or multiple teeth in dental studies. Clearly, data from some studies have both of these issues and some also have correlations among individuals who are clustered (e.g. treated by the same physician). Knowledge of the strength of these correlations is needed to calculate power and these estimates usually come from past data as well. Even armed with all of these estimates, direct methods for estimating power may not exist for a particular method and the only solution may be to simulate data under the expected conditions to calculate the needed sample size. More lead time is needed by the biostatistician in these cases to write the programs. Sometimes a situation seems simple but there are hidden statistical issues so it is always a good idea to discuss the sample size with the statistical analysis team.

*Choice of Statistical Techniques*

The critical issue in choosing the statistical methods to be used in the analysis plan is whether the specific aims are fulfilled and the key hypotheses tested. Once these scientific questions and statements are translated into statistical hypotheses, other data related issues can be considered. The initial consideration is the distribution of the outcome measure. Data that are dichotomous or categorical should be analyzed with methods designed for categorical data. Similarly, "classical" methods are generally appropriate for data that are approximately normal or for moderate to large samples where the means are approximately normal. In some cases, like survival data with censored observations or counts, specialized modeling techniques are required. Most statistical methods assume that the observations are independent. This assumption is violated for observations on the same person (or animal) and sometimes when observations are clustered, such as for utilization data from patients treated by the same physician. In these cases, mixed effects models or generalized estimating equations are often used. Since many studies have multiple outcome measures or subgroups of interest, another issue that often needs to be addressed is the effect of multiple tests on Type I error (i.e. the chance of falsely rejecting the null hypothesis). A final issue that is also nearly universal is the handling of missing data. Solutions depend on why the data are missing but they can range from just using the data that exist to multiple imputation methods to computer intensive methods that model the missing data mechanism.

*Assessment of Statistical Assumptions*

A complete data analysis plan must also include methods for verifying the assumptions needed to implement the statistical techniques that were proposed. These usually involve graphical and descriptive methods for assessing distributions, equal variances and independence of observations. These methods are usually implemented on the original data and on the residuals from the statistical model. The plan should also outline related methods for assessing the fit of the statistical models that are being used. Alternative methods to be used, when assumptions do not hold, should be included.

*Development of New Statistical Methodology*

Sometimes a protocol or proposal requires the development of new statistical methods or extensions of existing methods to new situations. This occurs when the issues to be considered in the choice of statistical techniques leads to situations where no appropriate methods exist or as an alternative method when the assessment of statistical assumptions reveals a problem. In either case, the application must propose a new method or extension but the details will often need to be completed as the data are collected and fully examined. Nevertheless, the proposal must contain enough information to give reviewers confidence that the new method can be developed. A track record in methodological research by members of the biostatistical team is extremely helpful in these situations.

*Plans for Implementing the Analyses*

When the analysis plan involves methods beyond commonly used techniques, it is often important to also describe the software that will be used to implement these methods. Although we use SAS for most of our analyses and it contains a wide array of techniques, sometimes other software packages may be necessary. SPLUS is also used frequently since it often implements algorithms for new methods sooner than SAS. Specific techniques are also often handled by specialized programs like SOLAS for missing data analysis or BUGS for Markov Chain Monte Carlo methods, which can be computationally intensive.

**Required Information Needed for Budget Development**

Basic background information we must have in order to prepare a budget estimate:

- Draft of Grant/Protocol
- Project Title
- Primary Investigator (with phone number and email address)
- Funding Agency
- Submission Date
- Funding Date
- Length of Funding
- Budget Contact (with phone number and email address)
- Cancer Center Membership (yes or no)

Other information that we may need depending on the situation or that would be of tremendous benefit in terms of obtaining the most accurate estimate possible is:

- Drafts of all data collection tools
- Approximate number of interviewers/data collectors (if applicable)
- Ballpark estimate of how much money/percent of time will be available for Biostatistics services