^{1}Frequently, the engineer will change some aspect of a nominal design point and run a simulation to see how the change affects the objective function (for example, take-off gross weight, or TOGW). Or the design process is made configurable, so the engineer can concentrate on accurately modeling one aspect while replacing the remainder of the design with fixed boundary conditions surrounding the focal area. However, both these approaches are inadequate for exploring large high-dimensional design spaces, even at low fidelity. Ideally, the design engineer would like a high-level mining system to identify the pockets that contain good designs and merit further consideration. The engineer can then apply traditional tools from optimization and approximation theory to fine-tune preliminary analyses.

^{2}It serves a primary role in many domains and a complementary role in others by augmenting traditional techniques from numerical analysis, statistics, and machine learning.

^{3}which supports construction of data interpretation and control design applications for spatially distributed physical systems in a bottom-up manner. Used as a basis for describing data mining algorithms, SAL programs also help exploit knowledge of physical properties such as continuity and locality in data fields. In conjunction with this process, we introduce a top-down sampling strategy that focuses data collection in only those regions that are deemed most important to support a data mining objective. Together, these processes define a methodology for mining in data-scarce domains. We describe this methodology at a high level and devote the major part of the article to two applications that use it.

^{4}

^{,}

^{5}Here, however, we are interested in a general framework or language that expresses data mining operations on data sets and that can help us study the design of data collection and sampling strategies. SAL is such a framework.

^{3}

^{,}

^{6}

^{7}They employ vision-like routines to manipulate multilayer geometric and topological structures in spatially distributed data. SAL adopts a field ontology in which the input is a field mapping from one continuum to another. Multilayer structures arise from continuities in fields at multiple scales. Owing to continuity, fields exhibit regions of uniformity, which can be abstracted as higher-level structures, which in turn exhibit their own continuities. Task-specific domain knowledge describes how to uncover such regions of uniformity, defining metrics for closeness of both field objects and their features. For example, isothermal contours are connected curves of nearby points with equal (or similar enough) temperature.

*spatial objects*, the compounds are collections of spatial objects, and the abstraction mechanisms connect collections at one level of abstraction with single objects at a higher level.

• (a) Establish a field that maps points (locations) to points (vector directions, assumed here to be normalized).

• (b) Localize computation with a neighborhood graph, so that only spatially proximate points are compared.

• (c-f) Use a series of local computations on this representation to find equivalence classes of neighboring vectors with respect to vector direction.

• (g) Redescribe equivalence classes of vectors into more abstract streamline curves.

• (h) Aggregate and classify these curves into groups with similar flow behavior, using the exact same operators but with different metrics (code not shown).

^{8}to analysis of diffusion-reaction morphogenesis.

^{9}

^{10}

^{-}

^{12}Many of these approaches (including ours) rely on capturing a desirable design's properties in terms of a novel objective function. Our work's distinguishing feature is that it uses spatial information gleaned from a higher level of abstraction to focus data collection at the field or simulation code layer.

^{13}

^{,}

^{14}

**de Boor's function**Carl de Boor invented a pocket function that exploits containment properties of the

*n*-sphere of radius 1 centered at the origin with respect to the

*n*-dimensional hypercube defined by . Although the sphere is embedded in the cube, the ratio of the volume of the cube (2

*n*) to that of the sphere grows unboundedly with

*n*. This means that a high-dimensional cube's volume is concentrated in its corners (a counterintuitive notion at first). de Boor exploited this property to design a difficult-to-optimize function that assumes a pocket in each corner of the cube (see Figure 6), just outside the sphere. Formally, we can define it as

(1)

(2)

(3)

where **X** is the *n*-dimensional point ( *x*_{1}, *x*_{2}, …, *x _{n}*) at which the pocket function

*p*is evaluated,

**I**is the identity

*n*-vector, and is the

*L*

_{2}norm.

Obviously, *p* has 2 ^{n} pockets (local minima). If *n* is large (say, 30, which means representing the corners of the *n*-cube will take more than 500,000 points), naive global optimization algorithms will need an unreasonable number of function evaluations to find the pockets. Our goal for data mining here is to obtain a qualitative indication of the existence, number, and locations of pockets, using low-fidelity models or as few data points as possible. Then, we can use the results to seed higher-fidelity calculations. This fundamentally differs from DACE (Design and Analysis of Computer Experiments), ^{12} polynomial response surface approximations, ^{13} and other approaches in geostatistics where the goal is accuracy of functional prediction at untested data points. Here, we trade accuracy of estimation for the ability to mine pockets.

**Surrogate function**In this study, we use the SAL vector field bundling code presented earlier along with a surrogate model as the basis for generating a dense data field. Surrogate theory is an established area in engineering optimization, and we can build a surrogate in several ways. However, the local nature of SAL computations means that we can be selective about our choice of surrogate representation. For example, global, least-squares type approximations are inappropriate because measurements at all locations are equally considered to uncover trends and patterns in a particular region. We advocate the use of

*kriging-type interpolators*

^{12}which are local modeling methods with roots in Bayesian statistics. Kriging can handle situations with multiple local extrema and can easily exploit anisotropies and trends. Given

*k*observations, the interpolated model gives exact responses at these

*k*sites and estimates values at other sites by minimizing the mean-squared error (MSE), assuming a random data process with zero mean and a known covariance function.

Formally (for two dimensions), we assume the true function *p* to be the realization of a random process such as

(4)

where β is typically a uniform random variate, estimated based on the known *k* values of *p*, and *Z* is a correlation function. Kriging then estimates a model *p'* of the same form, on the basis of the *k* observations,

(5)

and minimizing MSE between *p*' and *p*,

(6)

A typical choice for *Z* in *p'* is σ ^{2}*R*, where scalar σ ^{2} is the estimated variance, and correlation matrix *R* encodes domain-specific constraints and reflects the data's current fidelity. We use an exponential function for entries in *R*, with two parameters *C*_{1} and *C*_{2}:

(7)

Intuitively, values at closer points should be more highly correlated.

We get the MSE-minimizing estimator by multidimensional optimization (the derivation from Equations 6 and 7 is beyond this article's scope):

(8)

This expression satisfies the conditions that there is no error between the model and the true values at the chosen *k* sites and that all variability in the model arises from the design of *Z*. Gradient descent or pattern search methods often perform the multidimensional optimization. ^{12}

**Data mining and sampling methodology**The bottom-up computation of SAL aggregates from the surrogate model's outputs will possibly lead to some ambiguous streamline classifications, as we discussed earlier. Ambiguity can reflect the desirability of acquiring data at or near a specified point—to clarify the correct classification and to serve as a mathematical criterion of information content. We can use information about ambiguity to drive data collection in several ways. In this study, we express the ambiguities as a distribution describing the number of possible good neighbors. This ambiguity distribution provides a novel mechanism to include qualitative information; streamlines that agree will generally contribute less to data mining, for information purposes. We thus define the information-theoretic measure

*M*(see Figure 5) to be the ambiguity distribution .

We define the functional as the posterior entropy *E*(-log *d*), where *d* is the conditional density of over the design space not covered by the current data values. By a reduction argument, minimizing this posterior entropy can be shown to maximize the prior entropy over the sampled design space. ^{12} In turn, this maximizes the amount of information obtained from an experiment (additional data collection). Moreover, we also incorporate as an indicator covariance term in our surrogate model, which is a conventional method for including qualitative information in an interpolatory model. ^{11}

**Experimental results**Our initial experimental configuration used a face-centered design (four points, in the 2D case). A surrogate model by kriging interpolation then generated data on a 41

*-point grid. We used de Boor's function as the source for data values; we also employed pseudorandom perturbations of it that shift the pockets from the corners somewhat unpredictably.*

^{n}^{15}In total, we experimented with 200 perturbed variations of the 2D and 3D pocket functions. For each of these cases, we organized data collection in rounds of one extra sample each (to minimize the functional). We recorded the number of samples SAL needed to mine all the pockets and compared our results with those obtained from a pure DACE-kriging approach. In other words, we used the DACE methodology to suggest new locations for data collection and determined how those choices fared with respect to mining the pockets.

Figure 7 shows the distributions of the total number of data samples required to mine the four pockets for the 2D case. We mined the 2D pockets with three to 11 additional samples, whereas the conventional kriging approach required 13 to 19 additional samples. The results were more striking in the 3D case: at most 42 additional samples for focused sampling and up to 151 points for conventional kriging. This shows that our focused sampling methodology performs 40 to 75 percent better than sampling by conventional kriging.

Figure 8a describes a 2D design involving only seven total data points that can mine the four pockets. Counterintuitively, no additional sample is required in the lower-left quadrant. Although this will lead to a highly suboptimal design (from the traditional viewpoint of minimizing variance in predicted values), it is nevertheless an appropriate design for data mining purposes. In particular, this means that neighborhood calculations involving the other three quadrants are enough to uncover the pocket in the fourth quadrant. Because the kriging interpolator uses local modeling and because pockets in 2D effectively occupy the quadrants, obtaining measurements at ambiguous locations captures each dip's relatively narrow regime, which in turn helps distinguish the pocket in the neighboring quadrant. Achieving this effect is hard without exploiting knowledge of physical properties—in this case, locality of the dips.

**Jordan forms**A matrix (real or complex) that has

*r*independent eigenvectors has a Jordan form consisting of

*r*blocks. Each of these blocks is an upper triangular matrix that is associated with one of the eigenvectors of and whose size describes the corresponding eigenvalue's multiplicity. For the given matrix , the diagonalization thus posits a nonsingular matrix such that

(9)

where

(10)

and is the eigenvalue revealed by the *i*th Jordan block . The Jordan form is most easily explained by looking at how eigenvectors are distributed for a given eigenvalue. Consider, for example, the matrix

(11)

which has eigenvalues at 1, 1, and 2. This matrix has only two eigenvectors, as revealed by the two-block structure of its Jordan form:

The Jordan form shows that there is one eigenvalue (1) of multiplicity 2 and one eigenvalue (2) of multiplicity 1. We say that the matrix has the Jordan structure given by (1) ^{2} (2) ^{1}. In contrast, the matrix

(13)

has the same eigenvalues but a three-block Jordan structure:

This is because there are three independent eigenvectors (the unit vectors, actually). The diagonalizing matrix is thus the identity matrix, and the Jordan form has three permutations. The Jordan structure is therefore given by (1) ^{1} (1) ^{1} (2) ^{1}. These two examples show that a given eigenvalue's multiplicity could be distributed across one, many, or all Jordan blocks. Correlating the eigenvalue with the block structure is an important problem in numerical analysis.

The typical approach to computing the Jordan form is to follow the structure's staircase pattern and perform rank determinations in conjunction with ascertaining the eigenvalues. One of the more serious caveats with such an approach involves mistaking an eigenvalue of multiplicity greater than 1 for multiple eigenvalues. ^{16} In Equation 11, this might lead to inferring that the Jordan form has three blocks. The extra care needed to safeguard staircase algorithms usually involves more complexity than the original computation to be performed. The ill-conditioned nature of this computation has thus traditionally prompted numerical analysts to favor other, more stable, decompositions.

**Qualitative assessment of Jordan forms**A recent development has been the acceptance of a qualitative approach to Jordan structure determination, proposed by Françoise Chaitin-Chatelin and Valerie Frayssé.

^{17}This approach does not use the staircase idea. Instead, it exploits a semantics of eigenvalue perturbations to infer multiplicity, which leads to a geometrically intuitive algorithm that we can implement using SAL.

Consider a matrix that has eigenvalues with multiplicities . Any attempt at finding the eigenvalues (for example, determining the roots of the characteristic polynomial) is intrinsically subject to the numerical analysis dogma: the problem being solved will actually be a perturbed version of the original problem. This allows the expression of the computed eigenvalues in terms of perturbations on the actual eigenvalues. The computed eigenvalue corresponding to any will be distributed on the complex plane as

(15)

where the phase of the perturbation ranges over { } if is positive and over { } if is negative. Chaitin-Chatelin and Frayssé superimposed numerous such perturbed calculations graphically so that the aggregate picture reveals the of the eigenvalue . ^{17} The phase variations imply that the computed eigenvalues will lie on a regular polygon's vertices—centered on the actual eigenvalue—where the number of sides is two times the multiplicity of the considered eigenvalue. This takes into account both positive and negative . Because influences the polygon's diameter, iterating this process over many will lead to a "sticks" depiction of the Jordan form.

To illustrate, we choose a matrix whose computations will be more prone to finite precision errors. Perturbations on the 8 × 8 Brunet matrix with Jordan structure (-1) ^{1} (-2) ^{1} (7) ^{3} (7) ^{3} induce the superimposed structures in Figure 9. ^{17}Figure 9a depicts normwise relative perturbations in the scale of [2 ^{-50}, 2 ^{-40}]. The six sticks around the eigenvalue at 7 clearly reveal that its Jordan block is of size 3. The other Jordan block, also centered at 7, is revealed if we conduct our exploration at a finer perturbation level. Figure 9b reveals the second Jordan block using perturbations in the range [2 ^{-53}, 2 ^{-50}]. The noise in both pictures is a consequence of having two Jordan blocks with the same size and a "ring" phenomenon studied elsewhere. ^{18} We do not attempt to capture these effects in this article.

**Data mining and sampling methodology**For this study, we collect data by random normwise perturbations in a given region, and a SAL program determines multiplicity by detecting symmetry correspondence in the samples. The first aggregation level collects a given perturbation's samples into triangles. The second aggregation level finds congruent triangles via geometric hashing

^{19}and uses congruence to establish a correspondence relation among triangle vertices. This relation is then abstracted into a rotation about a point (the eigenvalue) and evaluated for whether each point rotates onto another and whether matches define regular polygons. A third level then compares rotations across different perturbations, revisiting perturbations or choosing new ones to disambiguate (see Figure 10).

The end result of this analysis is a confidence measure on models of possible Jordan forms. Each model is defined by its estimate of and (we work in one region at a time). The measure *M* is the joint probability distribution over the space of and .

**Experimental results**Because our Jordan form computation treats multiple perturbations irrespective of level as independent estimates of eigenstructure, the idea of sampling here is not where to collect, but how much to collect. The goal of data mining is hence to improve our confidence in model evaluation.

We organized data collection into rounds of six to eight samples each, varied a tolerance parameter for triangle congruence from 0.1 to 0.5 (effectively increasing the number of models posited), and determined the number of rounds needed to determine the Jordan form. As test cases, we used the set of matrices Chaitin-Chatelin and Frayssé studied. ^{17} On average, our focused sampling approach required one round of data collection at a tolerance of 0.1 and up to 2.7 rounds at 0.5. Even with a large number of models posited, additional data quickly weeded out bad models. Figure 10 demonstrates this mechanism on the Brunet matrix discussed earlier for two sets of sample points. To the best of our knowledge, this is the only known focused sampling methodology for this domain; we are unable to present any comparisons. However, by harnessing domain knowledge about correspondences, we have arrived at an intelligent sampling methodology that resembles what a human would get from visual inspection.

*M*. It also assumes that the problems the mining algorithm will encounter are the same as the problems for which it was designed. This is an inheritance from Bayesian inductive inference and leads to fundamental limitations on what we can do in such a setting. For instance, if new data does not help clarify an ambiguity, does the fault lie with the model or the data? We can summarize this problem by saying that the approach requires strong a priori information about what is possible and what is not.

*M*encapsulate knowledge about physical properties, which is what makes our methodology viable for data mining. In the future, we aim to characterize more formally the particular forms of domain knowledge that help overcome sparsity and noise in scientific data sets.

*active learning.*

#### References

**Naren Ramakrishnan**is an assistant professor of computer science at Virginia Tech. His research interests include problem-solving environments, mining scientific data, and personalization. He received his PhD in computer sciences from Purdue University. Contact him at the Dept. of Computer Science, 660 McBryde Hall, Virginia Tech, Blacksburg, VA 24061; naren@cs.vt.edu.

**Chris Bailey-Kellogg**is an assistant professor of computer sciences at Purdue. His research combines geometric, symbolic, and numeric approaches for data analysis and experiment planning in scientific and engineering domains. He received his BS and MS in electrical engineering and computer science from MIT and his PhD in computer and information science from Ohio State University. Contact him at the Dept. of Computer Sciences, 1398 Computer Science Bldg., Purdue Univ., West Lafayette, IN 47907; cbk@cs.purdue.edu.

| |||