Federal statistical agencies must fulfill two nearly contradictory missions. On the one hand,
they must extract and disseminate to other agencies, the research community and the public useful
information derived from sample surveys and censuses. But, they must also protect the confidentiality
of the data and the privacy of data subjects. Protecting confidentiality may be mandated by law,
prescribed by agency practices or promised to respondents. Often, confidentiality must be preserved
in order to ensure the quality of the data: respondents do not answer truthfully if they believe that
their privacy is threatened.
The tutorial is an overview of methods known collectively as statistical disclosure limitation (SDL)
that attempt to resolve this contradiction.
The tutorial will introduce participants to fundamental problems and methods of SDL, the latter
ranging from limiting access to data, to altering data prior to release, to releasing only the results of
"safe" statistical analyses of the data.
In particular, the development of computing and statistical technologies and the emergence of the
Internet as the principal mode for disseminating federal data both exacerbate the problems and offer
new kinds of solutions. The tutorial will describe the problems, especially record linkage to external
databases, as well as solutions such as analysis servers that account for the interactions among multiple
queries on the same database.
No deep prior knowledge of data confidentiality, statistics or computer science will be assumed.
This tutorial will focus on essential aspects of data confidentiality and SDL in the electronic world:
Basics of data confidentiality: fundamental abstractions such as identity, attribute and inferential
disclosures and disclosure risk.
Means of breaking confidentiality, especially record linkage to databases containing identifiers.
A primer on SDL, focusing on the strengths and limitations of techniques for "preventing"
disclosure, which preserve low-dimensional statistical characteristics of the data, but distort
disclosure-inducing high-dimensional characteristics. These include aggregation, cell suppression
in tabular data, data swapping, jittering, "top-coding" (to prevent disclosure on the basis
of extreme data values), and use of entirely synthetic databases that preserve some characteristics
of the original data, but whose records simply do not correspond to real individuals or
Risk-utility formulations, in which quantified measures of disclosure risk and data utility are
used to provide principled ways of constructing data dissemination strategies.
The tutorial will conclude by addressing the increasingly important problem, arising not only in "traditional"
settings but also in the context of homeland security and for proprietary corporate data, of
safely conducting informative statistical analyses on distributed databases whose owners cannot or
will not allow the data to be integrated.
Alan F. Karr is Director of the National Institute of Statistical Sciences (NISS),
a position he has held since 2000; prior to that he was Associate Director (1992-2000). He is also
Professor of Statistics/Operations Research and Biostatistics at the University of North Carolina at
Chapel Hill (since 1993), as well as Associate Director of the Statistical and Applied Mathematical
Sciences Institute (SAMSI).
His research activities are cross-disciplinary collaborations involving statistics and such other fields
as data confidentiality, data integration, data quality, education statistics, software engineering, information
technology, transportation, materials science and E-commerce. He is the author of three books
and nearly 100 scientific papers, a fellow of the American Statistical Association and the Institute of
Mathematical Statistics, a member of the Council of the latter and the Board of Governors of the Interface
Foundation of North America, and served as a member of the Army Science Board from 1990
Alan F. Karr, National Institute of Statistical Sciences, PO Box 14006, Research
Triangle Park, NC 27709-4006; Tel: 919-685-9300; FAX: 919-685-9310; E-mail: firstname.lastname@example.org