%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{diffpriv} \documentclass[twoside,11pt]{article} % Any additional packages needed should be included after jmlr2e. % Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx, % and defines many common macros, such as 'proof' and 'example'. % % It also sets the bibliographystyle to plainnat; for more information on % natbib citation styles, see the natbib documentation, a copy of which % is archived at http://www.jmlr.org/format/natbib.pdf \usepackage{jmlr2e} \usepackage{xspace} % Definitions of handy macros can go here \newcommand{\pckPlain}{diffpriv\xspace} \newcommand{\pck}{\textsf{\pckPlain}\xspace} \newcommand{\R}{\textsf{R}\xspace} \newcommand{\term}[1]{\emph{#1}\xspace} \newcommand{\cf}{\emph{cf.}\xspace} \newcommand{\eg}{\emph{e.g.},\xspace} \newcommand{\ie}{\emph{i.e.},\xspace} \newcommand{\cB}{\ensuremath{\mathcal{B}}\xspace} \newcommand{\cD}{\ensuremath{\mathcal{D}}\xspace} \newcommand{\cR}{\ensuremath{\mathcal{R}}\xspace} \newcommand{\reals}{\ensuremath{\mathbb{R}}\xspace} \renewcommand{\v}[1]{\ensuremath{\mathbf{#1}}} \renewcommand{\Pr}[1]{\ensuremath{\mathrm{Pr}\left(#1\right)}} \newcommand{\Exp}[1]{\ensuremath{\mathbb{E}\left[#1\right]}} \newcommand{\indic}[1]{\ensuremath{\mathbf{1}\left[#1\right]}} \newcommand{\code}[1]{\texttt{#1}\xspace} \newcommand{\myurl}[1]{\texttt{#1}} % \url for JMLR, \texttt for CRAN % Heading arguments are {volume}{year}{pages}{submitted}{published}{author-full-names} \jmlrheading{18}{2017}{1--5}{7/17}{}{}{Benjamin I. P. Rubinstein and Francesco Ald\`{a}} % Short headings should be running head and authors last names \ShortHeadings{\pckPlain: An R Package for Practical Differential Privacy}{Rubinstein and Ald\`{a}} \firstpageno{1} \begin{document} %%\SweaveOpts{concordance=TRUE} \title{\pckPlain: An R Package for Easy Differential Privacy} \author{\name Benjamin I. P. Rubinstein \email brubinstein@unimelb.edu.au \\ \addr School of Computing and Information Systems\\ The University of Melbourne, Parkville, VIC 3010, Australia \AND \name Francesco Ald\`{a} \email francesco.alda@rub.de \\ \addr Horst G\"ortz Institute for IT Security and Faculty of Mathematics\\ Ruhr-Universit\"at Bochum, D-44801 Bochum, Germany} \editor{TBD} \maketitle \begin{abstract}% <- trailing '%' for backward compatibility of .sty file The \R package \pck provides tools for statistics and machine learning under differential privacy. Suitable for releasing analyses on privacy-sensitive data to untrusted third parties, differential privacy has become the framework of choice for privacy-preserving learning. \pck delivers: (a) implementations of generic mechanisms for privatizing non-private target functions, including the Laplace, Gaussian, exponential and Bernstein mechanisms; (b) a recent sensitivity sampler due to \citet{rubinstein2017sampler} that empirically estimates the sensitivity of non-private targets---obviating mathematical analysis for exact sensitivity bounds needed for most generic mechanisms; (c) an extensible framework for implementing differentially-private mechanisms. Together, the components of \pck permit easy high-utility privatization of complex analyses, learners and even black-box software programs. \end{abstract} \begin{keywords} differential privacy, empirical process theory, R, open-source software \end{keywords} \section{Introduction} <>= library(knitr) opts_chunk$set(fig.width=12, fig.height=4, fig.path='', comment="#>", warning=FALSE, message=FALSE, tidy=FALSE, size="small") options(width=60) set.seed(53079239) # install package if necessary: #if(!require("qtl")) install.packages("qtl", repos="http://cran.us.r-project.org") @ Differential privacy~\citep{dwork2006calibrating} has quickly become a key framework for semantic guarantees of data privacy when releasing analysis on privacy-sensitive data to untrusted third parties. The framework's popularity is in part due to a suite of generic mechanisms for privatizing non-private target functions of data \ie statistics, estimation procedures, and learners. Common to most generic mechanisms is the requirement that the non-private target's sensitivity to data set perturbation is known and bounded. Unfortunately, bounding sensitivity can be prohibitively complex for potential end users. This paper describes the \pck \R package that implements generic mechanisms for differential privacy, along with our recent sensitivity sampler that replaces exact sensitivity bounds with empirical estimates~\citep{rubinstein2017sampler}. As a result, \pck privatizes a wide range of procedures under random differential privacy~\citep{hall2012random}, automatically without mathematical analysis and in many cases achieving high utility. \pck v0.4.2 is available from \myurl{https://brubinstein.github.io/diffpriv/} the project homepage and CRAN from \myurl{https://cran.r-project.org/package=diffpriv} under an open-source license. \section{Generic Mechanisms for Differential Privacy} \label{sec:generic} Fundamental to differential privacy is a privacy-sensitive \term{data set} (or \term{database}) $D\in\cD^n$ on \term{domain} \cD. In \pck a data set can be a \texttt{list}, \texttt{matrix}, \texttt{data.frame} or \texttt{vector}. We say that a pair of databases $D,D'\in\cD^n$ is \term{neighboring} if they differ on exactly one record. While individual records should not be revealed, we aim to release aggregate information on $D$ with a mechanism. A \term{mechanism} $M: \cD^n \to \cR$ is a random-valued function of databases taking values in a response set \cR; implemented in \pck as virtual (abstract) class \code{DPMech}. \begin{definition}[\citealp{dwork2006calibrating}] For $\epsilon>0$, mechanism $M: \cD^n \to \cR$ preserves \term{$\epsilon$-differential privacy} if for all neighboring pairs $D,D'\in\cD^n$, measurable response sets $R\subseteq\cR$, $\Pr{M(D)\in R}\leq \exp(\epsilon)\cdot\Pr{M(D')\in R}$. Alternatively for $\delta\in(0,1)$, relaxed \term{$(\epsilon,\delta)$-differential privacy} holds if $\Pr{M(D)\in R}\leq \exp(\epsilon)\cdot\Pr{M(D')\in R}+\delta$. \end{definition} A mechanism preserves DP if its response distributions are close on neighboring pairs: an adversary cannot determine an unknown record from responses, even with knowledge of the remaining records. Privacy parameters $\epsilon$ ($\epsilon,\delta$) are encapsulated in class \code{DPParamsEps} (\code{DPParamsDel}). \code{DPMech} generic method \texttt{releaseResponse(mechanism, privacyParams, X)} takes privacy parameters and a sensitive dataset, to private response. Most generic mechanisms in DP share a number of properties leveraged by the \pck package as follows. \textbf{Privatizing a target function.} Many mechanisms $M: \cD^n\to\cR$ seek to privatize a non-private \term{target} function $f: \cD^n\to\cB$, with range \cB often coinciding with \cR. Accordingly \code{DPMech} objects are initialized with \code{target} slot of type \code{function}. A mechanism's \code{releaseResponse()} method calls \code{target} to form privacy-preserving responses. \textbf{Normed target range space.} Target $f$'s output space \cB is typically imbued with a norm, denoted $\|\cdot\|_{\cB}$, needed for measuring the sensitivity of $f$'s outputs to input perturbation. \pck flexibly represents this norm within \code{DPMech} objects as described next. \textbf{Sensitivity-induced privacy.} Many mechanisms achieve differential privacy by calibrating randomization to the sensitivity of target $f$. Insensitive targets need less response randomization. On a pair of neighboring databases $D,D'\in\cD^n$ the \term{sensitivity} of $f$ is measured as $\Delta(D,D')= \|f(D) - f(D')\|_{\cB}$. \term{Global sensitivity} is the largest such value $\overline{\Delta}=\sup_{D,D'}\|f(D)=f(D')\|_{\cB}$ over all possible neighboring pairs. As we discuss in \citep{rubinstein2017sampler}, a broad class of generic mechanisms, taking sensitivity $\Delta$ as a parameter, are \term{sensitivity-induced private}: for any neighboring pair $D,D'\in\cD^n$ if $\Delta(D,D')\leq\Delta$ then the mechanism $M_{\Delta}$ run with parameter $\Delta$ achieves $\Pr{M_{\Delta}(D)\in R}\leq\exp(\epsilon)\cdot\Pr{M_{\Delta}(D')\in R}$ for all $R\subseteq\cR$. When run with $\Delta=\overline{\Delta}$, this condition holds for all neighboring pairs: $M_{\overline{\Delta}}$ satisfies $\epsilon$-DP. Similarly for $(\epsilon,\delta)$-DP. In fact this is how differential privacy is typically proved for such generic mechanisms. \code{DPMech} objects can be initialized with a \code{sensitivity} argument, stored in an S4 slot of the same name. If the user provides a manually-derived global sensitivity bound $\overline{\Delta}$, then \code{releaseResponse()} responses preserve $\epsilon$- or ($\epsilon,\delta$)-DP (depending on the specific mechanism). We now demonstrate this use case. \subsection{Example: Laplace Mechanism Release of the Sample Mean} \pck implements Laplace~\citep{dwork2006calibrating}, Gaussian~\citep{dwork2014algorithmic}, exponential~\citep{mcsherry2007mechanism}, and Bernstein~\citep{alda2017bernstein} mechanisms as \code{DPMech} subclasses \code{DPMechLaplace}, \code{DPMechGaussian}, \code{DPMechExponential}, and \code{DPMechBernstein}. An exponential example is included below. \code{DPMechLaplace} releases numeric vectors, adopting the $L_1$ norm (sum of absolutes) for $\|\cdot\|_{\cB}$. The mechanism releases vectors in $\cR=\cB=\reals^d$ by adding an i.i.d. sample of $d$ Laplace-distributed random variables with means 0 and scales $\overline{\Delta}/\epsilon$ to $f(D)$ to achieve $\epsilon$-DP. \code{DPMechGaussian} also privatizes numeric responses but under $L_2$-sensitivity and weaker $(\epsilon,\delta)$-DP; \code{DPMechBernstein} leverages the Laplace mechanism to release multivariate real-valued functions. We next demonstrate Laplace privatization of the sample mean on bounded data $\cD^n=[0,1]^n$, for which \cB dimension \code{dims} is one. Global sensitivity is readily bounded as $1/n$. %: For any %neighboring pair $D,D'\in[0,1]^n$, %$\Delta(D,D')=n^{-1}\left|\sum_{i=1}^n D_i - \sum_{i=1}^n D'_i\right|$. %Since $n-1$ records are identical, and all records are in $[0,1]$, this is %$|D_n - D'_n|/n \leq 1/n$. <>= library(diffpriv) f <- function(X) mean(X) ## target function n <- 100 ## dataset size mechanism <- DPMechLaplace(target = f, sensitivity = 1/n, dims = 1) D <- runif(n, min = 0, max = 1) ## the sensitive database in [0,1]^n pparams <- DPParamsEps(epsilon = 1) ## desired privacy budget r <- releaseResponse(mechanism, privacyParams = pparams, X = D) cat("Private response r$response:", r$response, "\nNon-private response f(D): ", f(D)) @ \vspace{-1em} \section{Sensitivity Sampling for Random Differential Privacy} \label{sec:sampler} When \code{target} global sensitivity is supplied as \code{sensitivity} within \code{DPMech} construction, responses are differentially private. Global sensitivity is known for \emph{idealizations} of \eg coefficients for regularized logistic regression~\citep{chaudhuri2009logistic} and the SVM~\citep{rubinstein2012svm,chaudhuri2011erm}. In complex applications such as privatizing a software function, however, \code{target}'s global sensitivity may not be readily available. For such cases, \pck implements the sensitivity sampler of \citet{rubinstein2017sampler} which forms a high-probability estimate of \code{target} sensitivity by repeated probing of sensitivity on random neighboring database pairs, leveraging tools from empirical process theory. Like sensitivity estimates, resulting privacy holds with high probability. \begin{definition}[\citealp{hall2012random}] A mechanism $M$ preserves \term{$(\epsilon,\gamma)$-random differential privacy} (with a corresponding form for $\epsilon,\delta,\gamma$) if $\forall R\subseteq\cR, \Pr{M(D)\in R}\leq\exp(\epsilon)\cdot\Pr{M(D')\in R}$ holds with probability at least $1-\gamma$ over random database pairs $D,D'$. \end{definition} While weaker than $\epsilon$-DP, RDP is arguably more natural than $(\epsilon,\delta)$-DP: The later safeguards all databases but not unlikely responses, while RDP protects against all responses but not pathological databases (as defined by the database sampling distribution). The sampling distribution can be anything meaningful \eg uniform, a Bayesian prior, a density from data privately fit by the Bernstein mechanism~\citep{alda2017bernstein}, etc. The \code{DPMech} method \code{sensitivitySampler(object, oracle, n, m, gamma)} requires: a mechanism \code{object}; a function \code{oracle} which outputs i.i.d. random databases of given size \code{n} which should match the size of input data supplied later in calls to \code{releaseResponse()}; a sensitivity sample size \code{m}; and desired privacy confidence \code{gamma}. Either (but not both) of \code{m}, \code{gamma} can be omitted: the omitted resource will be optimized automatically. For example \code{m} taken small (few hundred) reflects limited sampling time; small \code{gamma} (\eg 0.05) prioritizes privacy. The sensitivity sampler calls appropriate \code{DPMech} \code{sensitivityNorm()} which implements $\Delta(D,D')$ for the mechanism's norm and stored \code{target}. New subclasses of \code{DPMech} need only implement this method in order to take advantage of the sensitivity sampler. After \code{sensitivitySampler()}, subsequent \code{releaseResponse()} results have a privacy parameter slot of type \code{DPParamsGam} indicating response RDP. \subsection{Example: Frequent Characters with Sensitivity Sampling} All \code{DPMech} subclasses are sensitivity-induced private and can be sensitivity sampled. We demonstrate the exponential mechanism, which privately optimizes a given objective $s(r)$ of candidate response $r\in\cR$. Its response distribution is proportional to $\exp(\epsilon \cdot s(r) / (2\Delta))$. Typically $s$ depends on input $D$, and so \code{DPMechExponential} is initialized with \code{target} that takes $D$ and outputs score function $s(r)$. That is, $\cB=\reals^{\cR}$ is a real-valued function space on \cR and the class's \code{sensitivityNorm()} implements the sup-norm (\cf \citealp{rubinstein2017sampler}). In practice, users supply \code{target} as an \R closure as demonstrated below. Given \code{sensitivity} of \code{target}, the mechanism preserves $\epsilon$-DP; if \code{sensitivitySampler()} estimates sensitivity with some \code{gamma}, then confidence $\gamma=$\code{gamma} RDP is preserved. We can apply these ideas to find the most frequent a--z character within the top-10 computer scientist names from Semantic Scholar, subject to individual privacy. Exponential privately maximizes total frequency. But without bounded name lengths, this function has unbounded global sensitivity. We therefore use the sensitivity sampler for $(1,0.1)$-RDP, with an oracle that samples representative U.S. names with package \code{randomNames}. <>= library(randomNames) ## a package that generates representative random names oracle <- function(n) randomNames(n) D <- c("Michael Jordan", "Andrew Ng", "Andrew Zisserman","Christopher Manning", "Jitendra Malik", "Geoffrey Hinton", "Scott Shenker", "Bernhard Scholkopf", "Jon Kleinberg", "Judea Pearl") n <- length(D) f <- function(X) { function(r) sum(r == unlist(base::strsplit(X, ""))) } rSet <- as.list(letters) ## the response set, letters a--z, must be a list mechanism <- DPMechExponential(target = f, responseSet = rSet) mechanism <- sensitivitySampler(mechanism, oracle = oracle, n = n, gamma = 0.1) pparams <- DPParamsEps(epsilon = 1) r <- releaseResponse(mechanism, privacyParams = pparams, X = D) cat("Private response r$response: ", r$response, "\nNon-private f(D) maximizer: ", letters[which.max(sapply(rSet, f(D)))]) @ \vspace{-2em} %Since mechanisms, notably Laplace and exponential, tend to use the norm $\|\cdot\|_{\cB}$ only for computing sensitivity $\Delta(D,D')$, and tend to assume one particular norm, \pck represents a minimal interface to the norm via \code{DPMech} method \code{sensitivityNorm()}. In implemented generic mechanisms that require a fixed norm, this norm is built into the corresponding method. For (future) user-defined mechanisms, function-dependent norms can also be supplied at initialization by extended the class constructor. %pressure phenotype, will consider just the \Sexpr{1+2} individuals with %%<>= %%hist(rnorm(100)) %%@ %\section{R and package versions used} % %<>= %sessionInfo() %@ % Acknowledgements should go at the end, before appendices and references \acks{B. Rubinstein and F. Ald\`a acknowledge the support of the Australian Research Council (DE160100584) and the DFG Research Training Group GRK 1817/1 respectively.} \vskip 0.2in \bibliography{diffpriv} \end{document}