Scientific and Statistical Database Management, International Conference on (2006)
July 3, 2006 to July 5, 2006
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/SSDBM.2006.30
Stefania Bellavia , University of Florence, Italy
Stefano Lodi , University of Bologna, Italy
Benedetta Morini , University of Florence, Italy
Kernel density estimators are a popular family of nonparametric estimators with applications to exploratory statistics and data mining. Since kernel estimators must be constructed from the data, if the data are sensitive, only indirect representations of the estimate, such as graphs or tabulations, can be stored or transmitted. However, even such representations might contain enough information to allow for data reconstruction, yielding an inference problem for kernel estimates. The inference problem for kernel estimators can be described by a system of nonlinear equations that arises naturally from the kernel estimate of a multivariate dataset. The solution to the system is the set of data from which the kernel estimate was computed and, in practice a good approximation to the solution is not available. A serious threat to data privacy is posed by publicly available solvers for nonlinear systems. This paper investigates the numerical solution of the nonlinear systems arising from the kernel estimate of a multivariate dataset and shows that this task is challenging. In fact, the Jacobian matrix of the system is numerically singular and a large number of solvers for nonlinear equations fails as they have to solve linear systems whose coefficient matrix is given by the Jacobian. Further, up to date solvers for optimization problems that do not suffer from this drawback may fail to solve the nonlinear system. To show this fact, we tested a subspace trustregion method, a BFGS method and a gradient projection method on both a synthetic and a real dataset. These methods are able to find a solution to the optimization problem even starting far from it. However, the experimental results on both the synthetic and the real dataset show that, if the initial guess is not very close to the solution, all three meth- yielding an inference problem for kernel estimates. Consider for instance an investment bank database. Different customers have invested their savings in two funds in different amounts. If a bi-dimensional kernel estimate is graphically displayed as a three-dimensional graph, then it might be possible to derive a set of (x, y, z) triplets describing the input/output relationship of the estimate at given points on the plane. In the presentation of results, usually the parameters of the estimate are communicated, therefore the analytical form of the estimate is entirely known (except the data points). This knowledge could be exploited to recover the data points by attempting to solve a system of equations having the data points as variables. Then, the dataset could be compared to information leaked from other sources in order to assign the reconstructed points to individual customers.
S. Bellavia, S. Lodi and B. Morini, "Inferences on Kernel Density Estimates by Solving Nonlinear Systems," 18th International Conference on Scientific and Statistical Database Management(SSDBM), Vienna, 2006, pp. 389-397.