%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{diffpriv}

\documentclass[twoside,11pt]{article}

% Any additional packages needed should be included after jmlr2e.
% Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx,
% and defines many common macros, such as 'proof' and 'example'.
%
% It also sets the bibliographystyle to plainnat; for more information on
% natbib citation styles, see the natbib documentation, a copy of which
% is archived at http://www.jmlr.org/format/natbib.pdf

\usepackage{jmlr2e}
\usepackage{xspace}

% Definitions of handy macros can go here

\newcommand{\pckPlain}{diffpriv\xspace}
\newcommand{\pck}{\textsf{\pckPlain}\xspace}
\newcommand{\R}{\textsf{R}\xspace}
\newcommand{\term}[1]{\emph{#1}\xspace}
\newcommand{\cf}{\emph{cf.}\xspace}
\newcommand{\eg}{\emph{e.g.},\xspace}
\newcommand{\ie}{\emph{i.e.},\xspace}
\newcommand{\cB}{\ensuremath{\mathcal{B}}\xspace}
\newcommand{\cD}{\ensuremath{\mathcal{D}}\xspace}
\newcommand{\cR}{\ensuremath{\mathcal{R}}\xspace}
\newcommand{\reals}{\ensuremath{\mathbb{R}}\xspace}
\renewcommand{\v}[1]{\ensuremath{\mathbf{#1}}}
\renewcommand{\Pr}[1]{\ensuremath{\mathrm{Pr}\left(#1\right)}}
\newcommand{\Exp}[1]{\ensuremath{\mathbb{E}\left[#1\right]}}
\newcommand{\indic}[1]{\ensuremath{\mathbf{1}\left[#1\right]}}
\newcommand{\code}[1]{\texttt{#1}\xspace}
\newcommand{\myurl}[1]{\texttt{#1}} % \url for JMLR, \texttt for CRAN

% Heading arguments are {volume}{year}{pages}{submitted}{published}{author-full-names}

\jmlrheading{18}{2017}{1--5}{7/17}{}{}{Benjamin I. P. Rubinstein and Francesco Ald\`{a}}

% Short headings should be running head and authors last names

\ShortHeadings{\pckPlain: An R Package for Practical Differential Privacy}{Rubinstein and Ald\`{a}}
\firstpageno{1}

\begin{document}
%%\SweaveOpts{concordance=TRUE}

\title{\pckPlain: An R Package for Easy Differential Privacy}

\author{\name Benjamin I. P. Rubinstein \email brubinstein@unimelb.edu.au \\
       \addr School of Computing and Information Systems\\
       The University of Melbourne,
       Parkville, VIC 3010, Australia
       \AND
       \name Francesco Ald\`{a} \email francesco.alda@rub.de \\
       \addr Horst G\"ortz Institute for IT Security and Faculty of Mathematics\\
       Ruhr-Universit\"at Bochum,
       D-44801 Bochum, Germany}

\editor{TBD}

\maketitle

\begin{abstract}%   <- trailing '%' for backward compatibility of .sty file
The \R package \pck provides tools for statistics and machine
learning under differential privacy. Suitable for releasing analyses
on privacy-sensitive data to untrusted third parties, differential
privacy has become the framework of choice for privacy-preserving
learning. \pck delivers: (a) implementations of generic
mechanisms for privatizing non-private target functions, including the
Laplace, Gaussian, exponential and Bernstein mechanisms; (b) a recent
sensitivity sampler due to \citet{rubinstein2017sampler} that empirically
estimates the sensitivity of non-private targets---obviating mathematical
analysis for exact sensitivity bounds needed for most generic mechanisms;
(c) an extensible framework for implementing differentially-private mechanisms.
Together, the components of \pck permit easy high-utility privatization
of complex analyses, learners and even black-box software programs.
\end{abstract}

\begin{keywords}
differential privacy, empirical process theory, R, open-source software
\end{keywords}

\section{Introduction}

<<knitr_options, include=FALSE>>=
library(knitr)
opts_chunk$set(fig.width=12, fig.height=4, fig.path='', comment="#>",
               warning=FALSE, message=FALSE, tidy=FALSE, size="small")
options(width=60)
set.seed(53079239)
# install package if necessary:
#if(!require("qtl")) install.packages("qtl", repos="http://cran.us.r-project.org")
@

Differential privacy~\citep{dwork2006calibrating} has quickly become a
key framework for semantic guarantees of data privacy when releasing
analysis on privacy-sensitive data to untrusted third parties.
The framework's popularity is in part due to a suite of generic mechanisms
for privatizing non-private target functions of data \ie statistics,
estimation procedures, and learners. Common to most generic mechanisms
is the requirement that the non-private target's sensitivity to data set
perturbation is known and bounded. Unfortunately, bounding sensitivity can be
prohibitively complex for potential end users. This paper
describes the \pck \R package that implements generic
mechanisms for differential privacy, along with our recent sensitivity
sampler that replaces exact sensitivity bounds with empirical
estimates~\citep{rubinstein2017sampler}. As a result, \pck privatizes a wide
range of procedures under random differential privacy~\citep{hall2012random},
automatically without mathematical analysis and in many cases achieving high
utility. \pck v0.4.2 is available from
\myurl{https://brubinstein.github.io/diffpriv/} the project homepage and CRAN
from \myurl{https://cran.r-project.org/package=diffpriv} under an open-source
license.

\section{Generic Mechanisms for Differential Privacy}
\label{sec:generic}

Fundamental to differential privacy is a privacy-sensitive
\term{data set} (or \term{database}) $D\in\cD^n$ on \term{domain} \cD.
In \pck a data set can be a \texttt{list}, \texttt{matrix},
\texttt{data.frame} or \texttt{vector}. We say that a pair of
databases $D,D'\in\cD^n$ is \term{neighboring} if they differ on
exactly one record. While individual records should not
be revealed, we aim to release aggregate information
on $D$ with a mechanism. A \term{mechanism} $M: \cD^n \to \cR$ is a
random-valued function of databases taking values in a response set \cR;
implemented in \pck as virtual (abstract) class \code{DPMech}.

\begin{definition}[\citealp{dwork2006calibrating}]
For $\epsilon>0$, mechanism $M: \cD^n \to \cR$ preserves
\term{$\epsilon$-differential privacy} if for all neighboring pairs
$D,D'\in\cD^n$, measurable response sets
$R\subseteq\cR$, $\Pr{M(D)\in R}\leq \exp(\epsilon)\cdot\Pr{M(D')\in R}$.
Alternatively for $\delta\in(0,1)$, relaxed
\term{$(\epsilon,\delta)$-differential privacy} holds if
$\Pr{M(D)\in R}\leq \exp(\epsilon)\cdot\Pr{M(D')\in R}+\delta$.
\end{definition}

A mechanism preserves DP if its response distributions are
close on neighboring pairs: an adversary cannot determine an
unknown record from responses, even with knowledge of the remaining
records.
Privacy parameters $\epsilon$ ($\epsilon,\delta$) are encapsulated in
class \code{DPParamsEps} (\code{DPParamsDel}).
\code{DPMech} generic method
\texttt{releaseResponse(mechanism, privacyParams, X)} takes privacy
parameters and a sensitive dataset, to private response. Most generic
mechanisms in
DP share a number of properties leveraged by the \pck package
as follows.

\textbf{Privatizing a target function.} Many mechanisms $M: \cD^n\to\cR$
seek to privatize a non-private \term{target} function $f: \cD^n\to\cB$, with
range \cB often coinciding with \cR. Accordingly
\code{DPMech} objects are initialized with \code{target} slot of type
\code{function}. A mechanism's \code{releaseResponse()} method calls
\code{target} to form privacy-preserving responses.

\textbf{Normed target range space.} Target $f$'s output space \cB
is typically imbued with a norm, denoted $\|\cdot\|_{\cB}$, needed
for measuring the sensitivity of $f$'s outputs to input perturbation. \pck flexibly
represents this norm within \code{DPMech} objects as described next.

\textbf{Sensitivity-induced privacy.} Many mechanisms achieve differential
privacy by calibrating randomization to the sensitivity of target
$f$. Insensitive targets need less response
randomization. On a pair of neighboring databases $D,D'\in\cD^n$ the
\term{sensitivity} of $f$ is measured as $\Delta(D,D')= \|f(D) - f(D')\|_{\cB}$.
\term{Global sensitivity} is the largest such value
$\overline{\Delta}=\sup_{D,D'}\|f(D)=f(D')\|_{\cB}$ over all possible
neighboring pairs.
As we discuss in \citep{rubinstein2017sampler}, a broad class of generic
mechanisms, taking sensitivity $\Delta$ as a parameter, are
\term{sensitivity-induced private}: for any neighboring pair $D,D'\in\cD^n$ if
$\Delta(D,D')\leq\Delta$ then the mechanism $M_{\Delta}$ run with parameter
$\Delta$ achieves
$\Pr{M_{\Delta}(D)\in R}\leq\exp(\epsilon)\cdot\Pr{M_{\Delta}(D')\in R}$
for all $R\subseteq\cR$. When run with $\Delta=\overline{\Delta}$,
this condition holds for all neighboring pairs: $M_{\overline{\Delta}}$
satisfies $\epsilon$-DP. Similarly for
$(\epsilon,\delta)$-DP. In fact this is how differential
privacy is typically proved for such generic mechanisms.
\code{DPMech} objects can be initialized with a \code{sensitivity} argument,
stored in an S4 slot of the same name. If the user provides a manually-derived
global sensitivity bound $\overline{\Delta}$, then \code{releaseResponse()}
responses preserve $\epsilon$- or ($\epsilon,\delta$)-DP (depending on
the specific mechanism). We now demonstrate this use case.

\subsection{Example: Laplace Mechanism Release of the Sample Mean}
\pck implements Laplace~\citep{dwork2006calibrating},
Gaussian~\citep{dwork2014algorithmic},
exponential~\citep{mcsherry2007mechanism}, and
Bernstein~\citep{alda2017bernstein} mechanisms as \code{DPMech} subclasses
\code{DPMechLaplace}, \code{DPMechGaussian}, \code{DPMechExponential}, and
\code{DPMechBernstein}. An exponential example is included below.
\code{DPMechLaplace} releases numeric vectors,
adopting the $L_1$ norm (sum of absolutes) for $\|\cdot\|_{\cB}$. The
mechanism releases vectors in $\cR=\cB=\reals^d$ by adding an i.i.d. sample of
$d$ Laplace-distributed random variables with means 0 and scales
$\overline{\Delta}/\epsilon$ to $f(D)$ to achieve $\epsilon$-DP.
\code{DPMechGaussian} also privatizes numeric responses but under
$L_2$-sensitivity and weaker $(\epsilon,\delta)$-DP; \code{DPMechBernstein}
leverages the Laplace mechanism to release multivariate real-valued functions.

We next demonstrate Laplace privatization of the sample mean on
bounded data $\cD^n=[0,1]^n$, for which \cB dimension \code{dims} is
one. Global sensitivity is readily bounded as $1/n$. %: For any
%neighboring pair $D,D'\in[0,1]^n$,
%$\Delta(D,D')=n^{-1}\left|\sum_{i=1}^n D_i - \sum_{i=1}^n D'_i\right|$.
%Since $n-1$ records are identical, and all records are in $[0,1]$, this is
%$|D_n - D'_n|/n \leq 1/n$.

<<genericLaplace, include=TRUE, echo=TRUE, results='markup'>>=
library(diffpriv)
f <- function(X) mean(X) ## target function
n <- 100 ## dataset size
mechanism <- DPMechLaplace(target = f, sensitivity = 1/n, dims = 1)
D <- runif(n, min = 0, max = 1) ## the sensitive database in [0,1]^n
pparams <- DPParamsEps(epsilon = 1) ## desired privacy budget
r <- releaseResponse(mechanism, privacyParams = pparams, X = D)
cat("Private response r$response:", r$response,
  "\nNon-private response f(D):  ", f(D))
@

\vspace{-1em}

\section{Sensitivity Sampling for Random Differential Privacy}
\label{sec:sampler}

When \code{target} global sensitivity is supplied as \code{sensitivity} within
\code{DPMech} construction, responses are differentially private.
Global sensitivity is known for \emph{idealizations} of \eg coefficients for
regularized logistic regression~\citep{chaudhuri2009logistic} and the
SVM~\citep{rubinstein2012svm,chaudhuri2011erm}. In complex applications
such as privatizing a software function, however, \code{target}'s global
sensitivity may not be readily available. For such cases, \pck implements the
sensitivity sampler of \citet{rubinstein2017sampler} which forms a
high-probability estimate of \code{target} sensitivity by repeated probing of
sensitivity on random neighboring database pairs, leveraging tools from
empirical process theory. Like sensitivity estimates, resulting privacy holds
with high probability.

\begin{definition}[\citealp{hall2012random}]
A mechanism $M$ preserves \term{$(\epsilon,\gamma)$-random differential privacy}
(with a corresponding form for $\epsilon,\delta,\gamma$) if
$\forall R\subseteq\cR, \Pr{M(D)\in R}\leq\exp(\epsilon)\cdot\Pr{M(D')\in R}$ holds
with probability at least $1-\gamma$ over random database pairs $D,D'$.
\end{definition}

While weaker than $\epsilon$-DP, RDP is arguably more natural than
$(\epsilon,\delta)$-DP: The later safeguards all databases but not
unlikely responses, while RDP protects against all responses but not
pathological databases (as defined by the database sampling distribution). The
sampling distribution can be anything meaningful \eg uniform, a Bayesian prior,
a density from data privately fit by the Bernstein
mechanism~\citep{alda2017bernstein}, etc.

The \code{DPMech} method \code{sensitivitySampler(object, oracle, n, m, gamma)}
requires: a mechanism \code{object}; a function \code{oracle} which outputs
i.i.d. random databases of given size \code{n} which should match the
size of input data supplied later in calls to \code{releaseResponse()}; a
sensitivity sample size \code{m}; and desired privacy confidence \code{gamma}.
Either (but not both) of \code{m}, \code{gamma} can be omitted: the omitted
resource will be optimized automatically. For example \code{m} taken small
(few hundred) reflects limited sampling time; small \code{gamma} (\eg 0.05)
prioritizes privacy. The sensitivity sampler calls appropriate \code{DPMech}
\code{sensitivityNorm()} which implements $\Delta(D,D')$ for the mechanism's
norm and stored \code{target}. New subclasses of \code{DPMech} need only
implement this method in order to take advantage of the sensitivity sampler.
After \code{sensitivitySampler()}, subsequent \code{releaseResponse()} results
have a privacy parameter slot of type \code{DPParamsGam} indicating response
RDP.

\subsection{Example: Frequent Characters with Sensitivity Sampling}
All \code{DPMech} subclasses are
sensitivity-induced private and can be sensitivity sampled. We demonstrate
the exponential mechanism, which privately optimizes a given
objective $s(r)$ of candidate
response $r\in\cR$. Its response distribution is proportional to
$\exp(\epsilon \cdot s(r) / (2\Delta))$. Typically $s$ depends on
input $D$, and so \code{DPMechExponential} is initialized with \code{target}
that takes $D$ and outputs score function $s(r)$. That is, $\cB=\reals^{\cR}$ is a
real-valued function space on \cR and the class's \code{sensitivityNorm()}
implements the sup-norm (\cf \citealp{rubinstein2017sampler}). In practice,
users supply \code{target} as an \R closure as demonstrated below. Given
\code{sensitivity} of \code{target}, the mechanism preserves $\epsilon$-DP;
if \code{sensitivitySampler()} estimates sensitivity with some \code{gamma},
then confidence $\gamma=$\code{gamma} RDP is preserved.

We can apply these ideas to find the most frequent a--z character within the
top-10 computer scientist names from Semantic Scholar, subject to
individual privacy. Exponential privately maximizes total frequency. But
without bounded name lengths, this function has unbounded global sensitivity. We
therefore use the sensitivity sampler for $(1,0.1)$-RDP, with an oracle that samples
representative U.S. names with package \code{randomNames}.

<<samplerExponential, include=TRUE, echo=TRUE, results='markup'>>=
library(randomNames) ## a package that generates representative random names
oracle <- function(n) randomNames(n)
D <- c("Michael Jordan", "Andrew Ng", "Andrew Zisserman","Christopher Manning",
       "Jitendra Malik", "Geoffrey Hinton", "Scott Shenker",
       "Bernhard Scholkopf", "Jon Kleinberg", "Judea Pearl")
n <- length(D)
f <- function(X) { function(r) sum(r == unlist(base::strsplit(X, ""))) }
rSet <- as.list(letters) ## the response set, letters a--z, must be a list
mechanism <- DPMechExponential(target = f, responseSet = rSet)
mechanism <- sensitivitySampler(mechanism, oracle = oracle, n = n, gamma = 0.1)
pparams <- DPParamsEps(epsilon = 1)
r <- releaseResponse(mechanism, privacyParams = pparams, X = D)
cat("Private response r$response: ", r$response,
  "\nNon-private f(D) maximizer:  ", letters[which.max(sapply(rSet, f(D)))])
@

\vspace{-2em}

%Since mechanisms, notably Laplace and exponential, tend to use the norm $\|\cdot\|_{\cB}$ only for computing sensitivity $\Delta(D,D')$, and tend to assume one particular norm, \pck represents a minimal interface to the norm via \code{DPMech} method \code{sensitivityNorm()}. In implemented generic mechanisms that require a fixed norm, this norm is built into the corresponding method. For (future) user-defined mechanisms, function-dependent norms can also be supplied at initialization by extended the class constructor.

%pressure phenotype, will consider just the \Sexpr{1+2} individuals with

%%<<summary_cross, fig.height=8>>=
%%hist(rnorm(100))
%%@

%\section{R and package versions used}
%
%<<sessionInfo, include=TRUE, echo=TRUE, results='markup'>>=
%sessionInfo()
%@

% Acknowledgements should go at the end, before appendices and references

\acks{B. Rubinstein and F. Ald\`a acknowledge the support
of the Australian Research Council (DE160100584) and the
DFG Research Training Group GRK 1817/1 respectively.}

\vskip 0.2in
\bibliography{diffpriv}

\end{document}