Issue No. 02 - Second (2012 vol. 5)
ISSN: 1939-1382
pp: 117-129
P. T. Wood , Dept. of Comput. Sci. & Inf. Syst., Birkbeck, Univ. of London, London, UK
A. Poulovassilis , London Knowledge Lab., Birkbeck, Univ. of London, London, UK
P. Selmer , London Knowledge Lab., Birkbeck, Univ. of London, London, UK
Introduction
Supporting the needs of lifelong learners has led to research into learner centered models of delivering learning resources and opportunities [ 13], [ 12] and into the role of online support in providing careers guidance [ 3]. In this direction, the L4All system aims to support lifelong learners in exploring learning opportunities and in planning and reflecting on their learning [ 4], [ 21]. The L4All system allows users to create and maintain a chronological record of their learning, work and personal episodes—their timelines. This approach is distinctive in that the timeline provides a record of Lifelong Learning, rather than learning at just one stage or period. Also, it provides a tool to understand social as well as educational factors that may influence career decisions and educational choices. In L4All, users' timelines are stored in the form of RDF/S, through the Jena framework, as is information about courses. Users can choose to make their timelines “public” and thus accessible by other users (within “public” timelines, individual episodes may however be marked as being “private,” for example, episodes of a personal nature that the user does not wish to share with other people). This sharing of timelines exposes future learning and work possibilities that may otherwise not have been considered, positioning successful learners as role models to inspire confidence and a sense of opportunity. The system's interface provides screens for the user to enter their personal details, to create and maintain their timeline, and to search over the timelines of other users, based on a variety of search criteria. However, the final evaluation of the L4All system [ 21] concluded that further work is needed from a technological perspective before it can be offered as an institutional service, particularly as relating to the timeline search. Specifically, problems were identified with the way in which users specify their search queries and with the ranking of the search results—we discuss these problems in Section 2. In this paper, we describe an alternative approach to supporting users' search over timeline data, based on query approximation and query relaxation techniques.
We begin with an overview of the L4All system in Section 2, to the level of detail necessary for this paper, and discuss its key features and limitations. Motivated by this discussion, we present in Section 3 a prototype system, called ApproxRelax, which allows users to construct their queries more precisely than with L4All, and which supports flexible matching of queries, returning answers ranked in order of their “distance” from the user's query. In Section 4, we compare this new approach with L4All. In Section 5, we discuss related work. We give our conclusions and discuss areas for further work in Section 6.
2. Overview of L4All
Fig. 1 (from [ 21]) shows the main screen of the L4All user interface. At its center is a visual representation of the user's timeline, and the system functionalities are organized around this. Each episode is displayed in chronological order, and is represented by an icon specific to its type (work, university, school, travel, etc.) and by a horizontal block representing its duration. Details of an episode can be viewed by clicking on the block representing it, causing a “balloon” to pop up containing more detailed information about the episode (dates, description), as well as access to edit and deletion functions.

Fig. 1. L4All main user interface screen.

There are some 20 types of episode supported by the system, each belonging to one of four categories: Educational, Occupational, Personal, and Other. Some types of episode are annotated by the user, when they create the episode, with a primary and possibly a secondary classification. These classifications are drawn from standard United Kingdom occupational and educational taxonomies. In particular, all Educational episodes are classified by a subject from the Labour Force Survey Subject of Degree (SBJ) classification and a qualification level from the National Qualifications Framework (NQF). All work and voluntary Occupational episodes are classified by an industry sector from the Standard Industrial Classification (SIC) and an occupation/position from the Standard Occupational Classification (SOC). We refer the reader to the Labour Force Survey User Guide for details of these standards. 1 In the L4All system, each classification hierarchy is limited to be up to four levels deep, so each classification annotation consists of between 1 and 4 identifiers, depending on the depth of the selected concept within the hierarchy.
A key aim of the L4All system is to allow learners to search over the timeline data, and to identify possible choices for their own future learning and professional development by seeing what others with a similar background have gone on to do. In particular, van Labeke et al. [ 22], [ 20] describe a facility that is provided by the system for searching for “people like me.” This facility allows the user to specify which parts of their own timeline should be matched with other users' (public) timelines, by selecting which types of episodes should be matched. The user also selects the similarity metric that should be applied and the “depth” of episode classification that should be taken into account when their episode data are being matched with that of others (i.e., whether 0, 1, 2, 3, or 4 of the identifiers comprising each classification annotation should be taken into account). The similarity metric can be one of: Jaccard Similarity, Dice Similarity, Euclidean Distance, and Needleman-Wunsch Distance, 2 each described in nontechnical terms for the user. In order for the system to be able apply these similarity metrics, the users' timelines are encoded as token-based strings. In particular, each episode is encoded as a single-token comprising a 2-letter unique identifier denoting the category of the episode, followed by up to two 4-digit codes classifying the episode according to the four levels of the taxonomies relevant for this type of episode (which may be 0, 1, or 2 taxonomies). The information about episodes' start and end dates is ignored and only the relative position of episodes is captured. Filters are applied to the string of tokens to remove those types of episode that should not be considered in the current search, and for limiting the depth of their classification to be considered in the matching process. We refer the reader to [ 22] for more details of the timeline encoding and for a detailed comparison of the different similarity metrics considered for incorporation within the system.
Once the user's definition of “people like me” has been specified, the system returns a list of all the candidate timelines, ranked by their normalized similarity. The user can then select one of these timelines to visualize in detail. This timeline is then displayed within the main interface screen as an extra strip below the user's own timeline. Episodes within the selected timeline that have been designated as “public” by its owner are visible, and the user can click on any of these to expose its details and to explore it further.
An evaluation of this search for “people like me” functionality was undertaken with a group of learners at Birkbeck College (which specializes in providing flexible learning opportunities for mature students) [ 20]. Although they could appreciate the potential of this functionality, participants reported difficulties in understanding the meaning of some of the search parameters, namely the “depth” of episode classification and the choice of search method (similarity metric). They also identified the need for a more contextualized usage of timeline similarity matching, which explicitly identifies possible future learning and professional possibilities for the user. In follow-on work, Van Labeke et al. [ 21] explored a more contextualized usage of timeline similarity matching which uses just one similarity metric (hence removing this element of choice, and potential difficulty, for the user) and which explicitly shows the episodes of the selected timeline that have no match within the user's timeline and thus represent episodes the user may be inspired to explore further for their own learning and career development. We briefly describe this “what next” facility below, in order to motivate our own approach described in Section 3.
2.1 What Next in L4All
The L4All “What Next” facility uses the Needleman-Wunsch similarity metric. This is because this is the only one of the metrics explored for adoption within the system that takes into consideration the positions of the tokens within the timeline encodings and hence that is able to generate an alignment between the tokens in the two timelines being matched (see [ 22]). Using the Needleman-Wunsch similarity metric, the What Next facility considers the distance between two strings of tokens and to be the minimum cost of transforming to by a series of insert or delete operations (replacement operations are not considered). The system builds a cost matrix incrementally by constructing a cost value for each pair of tokens and from each string, as follows:

where is the cost of a “gap” in one of the strings (set to 1 in practice). The final cost is the cost at the matrix entry , where is the number of tokens in and the number of tokens in .
A summary of the relevant timelines found by the system is presented to the user, ordered by their similarity to the user's timeline with respect to the specified parameters and summarized by a short description. The user can now select one of these timelines and see it displayed in the main window, below their own timeline—see Fig. 2 (from [ 21]). The information arising from the token-based alignment between the two timelines is used to indicate, using different colors, the status of each episode in the selected timeline.

Blue is used for episodes that match episodes in the user's own timeline—what is termed the “common ground” between the two timelines; in Fig. 2, these are the episodes “Museum Curator” and “Foundation Degree (FD) IT.”

Orange is used for episodes in the target timeline that occur after all blue episodes; these are deemed by the system to be relevant as a potential source of inspiration for this user as they occur after the matching episodes, and thus represent episodes the user may be inspired to explore further for their future learning and career development; in Fig. 2, these are the episodes “System Network Engineer,” “System Support Engineer,” “Data Center Support Team Leader.”

Gray is used for episodes deemed to be irrelevant—these are episodes in the target timeline that occur earlier than all blue episodes and will mostly include earlier experiences that will be irrelevant for the user; in Fig. 2, this is the episode “A Level”; gray is also used for episodes that occur interspersed among blue episodes in the target timeline, but have no match with episodes in the user's own timeline; in Fig. 2, these are the episodes “Diploma in CS” and “Courier.”

Van Labeke et al. [ 21] report on the results of two evaluation sessions that were held with mature learners at Birkbeck College and at the College of North East London assessing the L4All system. Overall, there was satisfaction with the main functionalities provided the system. Participants could see its potential value in helping learners reflect on their current learning, gaining self-confidence, and identifying possibilities for their future work and learning episodes. However, the evaluation sessions used as their sample timelines those of two current students on a Foundation Degree in IT and those of five recent alumni from the same course. Evaluation of the What Next functionality involved each participant logging in as one of the current students and searching over the other timelines, which by definition provided useful matches with the user's timeline (since the FD course episode which was the last episode in the user's timeline matched the same FD course episode within the alumni's timelines). Van Labeke et al. [ 21] highlight that further work is needed from a technological perspective before L4All can be offered as an institutional service. In particular, they identify three issues that need to be explored more deeply.

The top-ranked timelines in the list returned by What Next will be timelines that are most similar to the user's own timeline. These timelines may in practice offer few suggestions of episodes for the user's future development. The study in [ 21] avoided this problem by using a small set of timelines from current students and a larger set of timelines of alumni from the same course.

What level of detail should be used in episode classification for the purpose of applying the similarity metric: selection of different classification levels by the user will give rise to different similarity values, and therefore different possible timeline alignments.

Using the distance matrix computed by the Needleman-Wunsch algorithm to generate the episode alignments may generate several possible alignments between two timelines. Determining the “best” one in a given context is not easy, as subjective factors relating to the user's own definition of “relevance” have to be taken into account. The current implementation in L4All always makes the same default choice of alignment, whereby the “common ground” of matching episodes in the two timelines is selected to be as late as possible within the user's timeline.

The above problems arise because the What Next facility is rather rigid: it uses the whole of the user's timeline; it offers just one similarity metric over the timeline data; it allows just a single level of detail to be applied to the classifications of the selected categories of episode for the similarity matching; and the similarity matching is applied to all episodes of these categories in the user's timeline. Thus, there is limited flexibility for users to formulate their precise requirements for the timeline search and to explore alternative formulations of selected parts of their query.

Fig. 2. What Next: the user's timeline is displayed in the top half and the selected timeline is shown below.

3. Flexible Query Processing
We now present a prototype system—called ApproxRelax— which supports more flexible user querying of timeline data.
The data model underlying the ApproxRelax prototype is a semistructured one, comprising a directed graph and an ontology . contains nodes representing an entity instance or an entity class. contains edges representing relationships between members of . Each node in is labeled with a distinct constant. Each edge in is labeled with the name of a relationship, drawn from a finite set of labels , or with the label . contains nodes representing an entity class or a property. Each node in is labeled with a distinct constant. The subset of nodes in that represent entity classes is contained in . The set of labels of edges in , except for the label , is contained in the set of labels of property nodes in . Each edge in is labeled with one of: , , , or . We assume that . We note that this model encompasses RDF data, except that it does not allow for the representation of RDF's “blank” nodes. It also encompasses a fragment of the RDFS vocabulary: rdf:type, rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, and rdfs:range, abbreviated here by , , , , and , respectively.
Fig. 3 illustrates a fragment of data and metadata relating to a user's timeline (Dan). The oval nodes and the nodes adjacent to them are contained in . The edges between these nodes are contained in . The rectangular nodes are contained in (class nodes). The labels of edges in E (except ) are also contained in (property nodes). contains the edges labeled between class nodes, and also edges labeled and linking each property node to a class node (not shown in the figure). There are several types of episode, e.g., UniversityEpisode and WorkEpisode. Associated with each type of episode are several properties, e.g., qualif[ication] and job. Episodes are ordered according to their start date—as indicated by edges labeled next (for simplicity, the episodes' start and end dates are not shown). If two episodes have the same start date, the one that ends earlier is considered to precede the other. If two episodes have identical start and end dates, an arbitrary one is chosen as being the earlier one. An edge labeled prereq from one episode to another is an annotation created by the timeline's owner indicating that they consider that undertaking an earlier episode was necessary in order for them to be able to proceed to or achieve a later episode.

Fig. 3. A fragment of Dan's timeline data and metadata.

The query language underlying the ApproxRelax prototype is that of Conjunctive Regular Path (CRP) queries [ 2]. A conjunctive regular path query, , consisting of conjuncts is of the form

where each and is a variable or a constant, each is a variable appearing in the right-hand side of , and each is a regular expression over the alphabet from which edge labels in the graph are drawn. In our context, a regular expression is defined as follows:

where is the empty string, is any symbol in , “_” denotes the disjunction of all constants in , and the operators have their usual meaning.
The (exact) answer to a CRP query on a graph can be obtained as follows: we first find, for each , a relation over the scheme such that tuple if there is a path from to in that satisfies , that is, whose concatenation of edge labels is in (the language recognized by the regular expression ). We then form the natural join of relations and project over to .

Example 1. Suppose Gaby is studying on the Foundation Degree in Information Technology (FdIT) at Birkbeck College and she wishes to find out what possible future career choices there may be for her by seeing what other people with qualifications in Information Systems have gone on to do in their careers. This can be undertaken by the following CRP query, , over the timeline data (we stress that this is not what a user would enter; in the actual ApproxRelax system, queries such as this are automatically generated through users' interactions with the system's graphical user interface, which we describe in Section 3.1):


(?A,?B,?C) <-

(?A,type,UniversityEpisode),

(?A,qualif.type,InformationSystems),

(?A,prereq,?B),

(?B,type,WorkEpisode),

(?B,job.type,?C)

(Variables in a query have an inital question mark.) However, returns no results relating to Dan's timeline in Fig. 3, even though this timeline contains information that could be relevant to Gaby (because Dan studied Information Systems at the university and then undertook several IT-related work episodes). No results are returned because the above query specifies prereq as the edge connecting successive episodes, whereas users may or may not create full prereq metadata relating to their episodes. Although ?A can be instantiated to episode dan1 from Dan's timeline to satisfy the first two query conjuncts, there is no edge prereq from dan1 to satisfy the third conjunct.

In [ 9], we investigated approximate matching of CRP queries, allowing edit operations such as insertions, deletions, and substitutions of edge labels to be applied to the regular expressions , each with some edit cost. The ApproxRelax prototype currently supports insertions and deletions, each with a cost of . The user can configure the system to apply either insertion or deletion edit operations, or both, and can also set the value of .

Example 2. Recall Example 1 where Gaby's original query returned no answers because users may not create full prereq metadata relating to their timelines. To allow for such irregularities in the timeline data, Gaby can instead submit a variant of in which the conjunct (?A,prereq,?B) can be approximated:


(?A,?B,?C) <-

(?A,type,UniversityEpisode),

(?A,qualif.type,InformationSystems),

APPROX (?A,prereq,?B),

(?B,type,WorkEpisode),

(?B,job.type,?C)

Assuming that insertion and deletion edit operations are applied, the regular expression prereq can be approximated by regular expressions and next.prereq, both at edit distance from prereq (among other approximations). There are still no answers from Dan's timeline returned at edit distance . Further approximation of the regular expression to next does allow an answer to be returned from Dan's timeline at edit distance , namely (dan1, dan2, IT Operations Technicians). Approximation of the regular expression next.prereq to next.next.prereq also allows another answer to be returned at edit distance , namely (dan1, dan4, IT User Support Technicians). Gaby may judge either of these results to be relevant to her and can now ask the system to return the whole of Dan's timeline for her to visualize and explore further.

In [ 16], we investigated allowing also ontology-based relaxation to be applied to the regular expressions . This encompasses query relaxations that are entailed using information from the ontology , such as replacing a class label by that of a superclass or a property label by that of a superproperty. The ApproxRelax prototype implements, for the first time, ontology-based relaxation of regular path queries.

Example 3. Suppose that Gaby decides that she is interested in jobs categorized under Software Professionals and similar categories. She also decides to broaden her search by allowing qualifications that are similar to Information Systems. She might submit the following query, :


(?A,?B) <-

(?A,type,UniversityEpisode),

RELAX(?A,qualif.type,
InformationSystems),

APPROX(?A,next,?B),

(?B,type,WorkEpisode),

RELAX(?B,job.type,
SoftwareProfessionals)

This query allows Information Systems to be relaxed to its parent concept Mathematical & Computer Sciences, matching qualifications such as Computer Science, etc. (see also the timelines in Figs. 4 and 5); and Software Professionals to be relaxed to its parent concept Information & Communication Technology Professionals, matching occupations such as IT Strategy & Planning Professionals, etc. In parallel with this query relaxation, the third query conjunct is also being approximated at the same time.

Query answers will be returned to Gaby in increasing overall distance from the nonapproximated, nonrelaxed version of her query, . Query is the same as query but without the APPROX and the two instances of RELAX. The overall distance of a query answer from is the sum of the costs of the relaxation and edit operations that were required to be applied to in order to find that answer. We will return to the evaluation of in more detail in Sections 3.2-3.4.

3.1 The ApproxRelax Prototype
The ApproxRelax prototype provides users with a graphical user interface through which they can formulate their queries. We illustrate the process by considering how Gaby can formulate query . In order to facilitate ease of use, the more complex parts of the user interface are explained to the user by means of “tooltips” and hover-over text (indicated by boxes containing a “ ?”). ApproxRelax is a web application and to use it Gaby opens up her browser and proceeds to the screen shown in Fig. 6. This screen allows the user to start formulating a query by creating a query template for matching educational episodes or occupational episodes.

Fig. 4. A fragment of Liz's timeline data and metadata.

Fig. 5. A fragment of Al's timeline data and metadata.

Fig. 6. ApproxRelax query set-up.

Gaby clicks the “Create an educational episode” image and is presented with the screen shown in Fig. 7. From the “Type” drop-down menu she is able to make a choice from the Educational episode types, and she selects “University Episode.” From the “Subject” drop-down menu she is able to make a choice from different subject areas (as sourced from the SBJ taxonomy mentioned in Section 2). She selects “Information Systems” and ticks the “Fetch similar or related subjects?” checkbox.

Fig. 7. Constructing an educational episode query.

As she has not yet finished constructing her query, she clicks the “Next” button. At this point, the system generates internally these query conjuncts:
(?A,type,UniversityEpisode)

RELAX(?A,qualif.type,InformationSystems)
The first of these is generated from Gaby's selection of “University Episode” in the “Type” drop-down. The fact that Gaby designated this episode as an educational episode and selected “Information Systems” from the “Subject” drop-down gives rise to the second conjunct. Because she ticked the “Fetch similar or related subjects?” checkbox, this conjunct additionally has the RELAX keyword applied to it by the system.
Having clicked “Next,” Gaby is presented again with the screen in Fig. 6. Gaby now clicks the “Create an occupational episode” image and is presented with the screen shown in Fig. 8. As this is not the first episode of the query, there is a “Link from previous episode” drop-down, which allows the user to specify the way in which the previously specified episode is related to the one currently being specified. The possible choices here (in the current prototype) are next, next+, prereq, and prereq+, which are displayed in the drop-down using more user-friendly descriptions: “next episode,” “next or subsequent episode,” “direct prerequisite,” and “direct or indirect prerequisite.”

Fig. 8. Constructing an occupational episode query.

Gaby selects the “next episode” option and ticks the “Flexible matching of the link between this episode and the previous one?” checkbox. From the “Type” drop-down menu she is able to make a choice from the Occupational episode types, and she selects “Work Episode.” From the “Job” drop-down menu she is able to make a choice from different jobs (as sourced from the SOC taxonomy mentioned in Section 2). Gaby selects “Software Professionals” and ticks the “Fetch similar or related occupations?” checkbox. She has now finished constructing her query and clicks the “Done” button. At this point the system generates internally the following query conjuncts:
APPROX(?A,next,?B)                                           [C3]

(?B,type,WorkEpisode)                                     [C4]

RELAX(?B,job.type,
SoftwareProfessionals)
[C5]
Conjunct C3 links the query episode set up previously (denoted by ?A) to this second episode (denoted by ?B). It contains the selection made by Gaby in the “Link from previous episode” drop-down. Additionally, as Gaby has ticked the “Flexible matching ” checkbox, C3 has the APPROX keyword applied to it. Conjunct C4 is generated from Gaby's selection of “Work Episode” in the “Type” drop-down. The fact that Gaby designated this episode as an occupational episode, and selected “Software Professionals” from the “Job” drop-down gives rise to conjunct C5. Since Gaby ticked the “Fetch similar or related occupations?” checkbox, C5 additionally has the RELAX keyword applied it. The system provides the facility for the user to view the details of previously constructed query templates, as may be seen on the right-hand side of Fig. 8; more about this facility is detailed in the next paragraph.
The next screen that Gaby is presented with is shown in Fig. 9. It allows the user to view at a glance the query templates making up their query (this is identical to the facility seen on the right-hand side of Fig. 8) and allows the user to view previously constructed query templates while in the process of creating their query. The type of each episode is immediately clear, as denoted by the image. In Fig. 9, the second image has been clicked (its number is highlighted in red) and the information pertaining to Gaby's second query template is displayed. Gaby can now click on the “cog” image (which has the relevant tooltip) to execute her query.

Fig. 9. Viewing query templates.

Gaby is now presented with the screen shown in Fig. 10, which displays query results ranked in order of increasing distance from the nonapproximated, nonrelaxed version of her query. The derivation of these query results is discussed in more detail in the sections that follow. For each result, an avatar representing the timeline's owner is displayed, as well as their name, the episode in their timeline which matches the last query template of the user's query, the distance at which this result has been retrieved, and an automatically generated summary of the timeline's owner and contents of their timeline. The latter description is displayed to give the user an overview of the matching timeline so that they can decide whether they wish to explore it in more detail.

Fig. 10. Viewing the query results.

At present, this is as far as the ApproxRelax prototype goes in terms of displaying query results. As part of future work, we intend that clicking on the timeline owner's name will take the user to a screen similar to the “What Next” visualization in Fig. 2 earlier. The selected timeline would be displayed in the bottom part of this new screen. The top part of the screen would display a visual representation of the user's query, showing one block for each of the query templates, and aligning each block above the episode that it matches in the timeline displayed below. Also missing from the current ApproxRelax prototype are abilities for querying additional classifications according to the NQF and SIC taxonomies mentioned in Section 2, and for formulating query templates for Personal and Other episode types.
3.2 Query Evaluation without Approximation or Relaxation
To see how queries such as above are evaluated in the ApproxRelax prototype, we first consider the evaluation of exact CRP queries, without any approximation or relaxation, first comprising just one conjunct and then comprising multiple conjuncts.
A single-conjunct CRP query, , over a graph is of the form

where and are constants or variables, is a regular expression over as defined earlier, and is the subset of that are variables.
In order to compute the answer to a single-conjunct CRP query, we first construct a weighted NFA to recognize , using Thompson's construction (which makes use of -transitions) [ 1]. Each transition of is labeled with a label from and has a weight (i.e., cost) which is zero. If in the query is a constant , we annotate the initial state, , of with ; otherwise we annotate with a wildcard symbol “ ” that matches any constant. Likewise, depending on whether in the query is a variable or a constant , the final state, , of is annotated with “ ” or .
We next form the weighted product automaton, , of with the graph , viewing each node of as both an initial and a final state. The states of are of the form , with being a state of and . There is a transition in from state to state with label and weight if and only if there is transition with label and weight from to in and an edge labeled from to in . (Label can be , in which case in .)
To evaluate query above, if is a node of , we perform a shortest path traversal of starting from the state . Whenever we reach a state of , we output provided that matches the annotation on . If is a variable, we perform such a traversal of starting from the state for every node of . All exact answers returned always have a cost of zero, of course.
Turning now to the evaluation of multiconjunct CRP queries, a query evaluation tree is constructed for such a query, consisting of inner nodes denoting join operators and leaf nodes representing individual query conjuncts. The query is evaluated by joining the answers arising from the evaluation of each of its conjuncts, each of which is computed as the query tree is traversed.

Example 4. Consider query from Example 3 earlier, and in particular a variant, , containing no approximation or relaxation operations:


(?A,?B) <-

(?A,type,University Episode),                 [C1]

(?A,qualif.type,InformationSystems),  [C2]

(?A,next,?B),                                               [C3]

(?B,type,WorkEpisode),                             [C4]

(?B,job.type,SoftwareProfessionals)     [C5]

The answers produced for are shown in Fig. 11. The five columns refer to the answers produced for each of the five conjuncts of , showing in each case the instantiations of the conjunct's variables. The tuples that contribute to the final answer are displayed in bold. We see that the only answer for is (liz1,liz2).

3.3 Query Evaluation with Approximation
We now consider the evaluation of queries in which conjuncts may be prefixed with the APPROX keyword. Consider such a single-conjunct CRP query,

The edit operations that are currently supported in the ApproxRelax prototype are insertions and deletions of edge labels, each with an edit cost of (whose value can be configured by the user). The edit distance from a path in graph to a path is the minimum cost of any sequence of edit operations which transforms the sequence of edge labels of to the sequence of edge labels of . The edit distance of a path to a regular expression is the minimum edit distance from to any path that conforms to . Given a matching from the variables and constants of query to nodes in , such that constants are matched to themselves, we say that the tuple has edit distance to , defined as the minimum edit distance to of any path from to in . Note that if conforms to , then has edit distance zero to —this is the exact matching case, as described in Section 3.2.

Fig. 11. Evaluation of query .

The approximate answer of on is a list of pairs , ranked in order of nondecreasing edit distance. The approximate top- answer of on comprises the first tuples in the approximate answer of on .
In order to compute the answer to an approximated single-conjunct CRP query, we construct the approximate automaton corresponding to the automaton , whose construction was described in Section 3.2. augments with additional transitions that capture the insertion and deletion edit operations. For insertions, each state in is augmented with transitions, one for each , from back to . For deletions, each transition from a state to a state in labeled with a symbol gives rise to a transition in from to labeled with (the empty string). Each new transition in has a cost of . We form again the weighted product automaton, , but this time of with . In the same manner as previously described, we perform one or several shortest path traversals of (depending on whether is a constant) and upon reaching a state we output provided it matches the annotation on .
In contrast to Section 3.2, the answers retrieved so far are now stored in a list answers, ordered by nondecreasing edit distance. A new answer is added to this list only if it is not already in the list—this is to avoid returning the same answer multiple times, at increasing distances from the original query. Answers which are an exact match will be returned at a distance of zero. Any approximate answer will be returned at some nonzero cost equal to the edit distance (which will be some multiple of ).
To illustrate this process, consider the single-conjunct query APPROX(liz1,next,?B) which refers to the timeline in Fig. 4, and suppose that . We see that the only answer for this conjunct at distance 0 is liz2. There is an answer of liz3 at distance 1, due to the regular expression , i.e., next, being approximated to next.next by an insertion operation. Similarly, liz4 and liz5 are answers at distances 2 and 3, respectively, as a result of further insertions of next.

Example 5. Consider now another variation of query , , and again suppose that :


(?A,?B) <-

(?A,type,University Episode),

(?A,qualif.type,Information Systems),

APPROX(?A,next,?B),

(?B,type,Work Episode),

(?B,job.type,Software Professionals)

Assuming that, for simplicity of presentation, delete operations have been switched off, the answers produced for are shown in Fig. 12, where the answers produced for each individual conjunct are shown in the first five columns. In the third column, the edit distance of the answers is shown as well as the value of the attribute D. The final column shows the answers to the overall query, in order of increasing distance. The tuples that contribute to the first answer are shown in bold and those that contribute to the second answer are shown in italics; tuples that contribute to both answers are shown in both bold and italics. We see that there is one more answer returned, (liz1,liz3) at distance 1, compared to the answer for query earlier.

3.4 Query Evaluation with Approximation and Relaxation
The ApproxRelax prototype currently supports one form of relaxation of CRP queries, namely the replacement of a class by its immediate superclass. We term this a direct relaxation and assign a cost of to it (which can be configured by the user). In order to support this kind of relaxation, we assume that the subgraph of the ontology induced by edges labeled is acyclic. We also assume that all the class nodes from and all the edges labeled that are entailed by have been added to the graph (this can be done “offline” every time an edge labeled is inserted or deleted in , or the hierarchy is amended in ).
Consider a relaxed single-conjunct CRP query,

Let be a matching from the variables and constants of to nodes in the graph , such that constants are matched to themselves. We denote by . We represent paths in and strings in the form of a set of triples comprising a source and target node and an edge label (see [ 16]). A path in r-conforms to if there is a string such that the triple form of relaxes to the triple form of . The relaxation distance from to is the minimum cost of any sequence of direct relaxations which yields the triple form of from that of . The relaxation distance from to is the minimum relaxation distance from to for any string . The relaxation distance of , denoted , is the minimum relaxation distance to from any path that r-conforms to .

Fig. 12. Evaluation of query .

The relaxed answer of on is a list of pairs ranked in order of nondecreasing relaxation distance. The relaxed top- answer of on comprises the first tuples in the relaxed answer of on .
In order to compute the answer to a relaxed single-conjunct CRP query, we construct the relaxed automaton corresponding to , whose construction was described in Section 3.2. For each transition of weight such that is a final state annotated with the constant and is an immediate superclass of in , we add: 1) a new final state annotated with , 2) a copy of all of 's outgoing transitions to , and 3) a new transition of weight . We repeat this until no more states and transitions can be inferred. The process terminates because of our assumption that the subgraph of induced by edges labeled is acyclic.
We form again the weighted product automaton, , but this time of with . In the same manner as previously described, we obtain answers by traversing , and store these in a list answers, ordered by nondecreasing relaxation distance.
The evaluation of general multiconjunct CRP queries by the ApproxRelax prototype is undertaken incrementally, outputting results at a time (the value of is configurable), and it proceeds as follows:

1. A query tree is constructed, whose leaf nodes are query conjuncts and inner nodes are join operators. This is achieved by first computing the (acyclic) hypergraph of the conjuncts [ 19]. The query tree supports an interface, implementing the and functions. The evaluation of the query commences by invoking the query tree's function, which initializes the internal structures needed for each inner node and each leaf node.

2. -type automata are constructed for conjuncts that are not approximated or relaxed. The first exact answers for these are computed in the first iteration.

3. -type and -type automata are constructed for the approximated and relaxed constructs, respectively. Incremental construction of is undertaken for each conjunct, whereby its nodes and edges are computed only as far as the maximum distance required in order to retrieve the top results for the current iteration. Answers are computed and stored in nondecreasing distance order.

4. Evaluation of the query commences by the invocation of on the root node. cascades further down the tree until a conjunct node is reached. on a conjunct node computes the answers for the conjunct in order of nondecreasing distance. The query tree is then traversed in a bottom-up fashion, during which the answers from each conjunct undergo a natural join operation with the answers of their sibling conjunct node upon invocation of their parent node's function. For each pair of tuples that are joined, their individual distance values are added in order to obtain the distance value of the resulting tuple. The results from this join operation are then pipelined upward, providing input for the next level's join operation and so on, until the root of the tree is reached.

To illustrate, we return to the evaluation of the original query from Example 3, which contains both approximation and relaxation operations. We assume that has been set to 1 and to 2. The answers produced for are shown in Fig. 13, where the answers produced for each individual conjunct in are shown in the first five columns. The final column shows the overall query answers, in order of nondecreasing total distance. The conjunct answer tuples that contribute to the final answer tuples are italicized and are subscripted with the ordering of the final answer tuple (i.e., subscript denotes the th final answer tuple). In this figure, episodes , , and are short for , , and , respectively. We see that returns seven more answers than : (al1,al2) at distance 2, (liz4,liz5) at distance 8, (dan1,dan2) at distance 8, etc.

Fig. 13. Evaluation of query .

3.5 System Architecture of ApproxRelax
Fig. 14 illustrates the architecture of the ApproxRelax prototype. The prototype was developed using the Microsoft .NET 3.5 framework, and follows the MVC design architecture.

Fig. 14. The ApproxRelax system architecture.

When Gaby created her episodes as shown in Figs. 7 and 8, the Episode creator module in the web user interface layer managed these requests. The SBJ and SOC data (mentioned in Section 2) are stored in the data store layer. This data are used to populate the “Subject” and “Occupation” drop-downs. Access to the data store is facilitated by the Jena bridge module in the system layer. Episode creator also manages other aspects of the episode creation process, such as marking an episode for relaxation, or a link from a previous episode for approximation. This module invokes the Conjunct builder, which creates the query conjuncts from the query templates. When Gaby ran her query as shown in Fig. 9, the Query submitter module was invoked. It made a call to the Episode creator module in order to obtain all of the query conjuncts. The Query submitter then invoked the Query manager module, passing the list of conjuncts, as well as parameters such as the values for and , which edit operations to apply, and how many results to return. Once the Query manager module has obtained the results, it invokes the Result manager module, which manages the display of the results in ranked order.
The system layer constitutes the bulk of the processing functionality. Query evaluation commences once Query submitter invokes Query manager with the conjuncts and aforementioned parameters. Query manager invokes the Query Tree builder module with the conjuncts, which constructs the query tree and passes it back to Query manager. This next passes the query tree to the Query Tree initialiser module, which initializes the various structures needed for query evaluation. Whenever Query Tree initialiser encounters a conjunct within the query tree, the Conjunct initialiser module is invoked on the conjunct. This invokes the NFA builder module to construct the automaton corresponding to the conjunct's regular expression . If the conjunct is approximated or relaxed, NFA manager will be invoked to transform to or , respectively.
The initialized query evaluation tree is then passed back to Query manager, whereupon Query Tree evaluator is invoked. This traverses the tree, starting from the leftmost leaf node, and proceeding upward. If the node is a leaf the ranked answers for the query conjunct are computed by Conjunct evaluator. This module forms the weighted product automaton, , of the conjunct's automaton with the graph, , whose nodes are sourced from the timeline data stored the data store. The construction of is computed incrementally, and only includes nodes and edges relevant to the maximum distance required in order to retrieve the next results. Conjunct evaluator traverses to obtain the ranked answers. For nodes representing joins, Query Tree evaluator works in conjunction with Join manager to perform a ranked join of the answers obtained thus far. Once the root of the query tree has been reached, the processing terminates and the list answers now holds the next results, ranked by increasing distance. Query manager passes this list to the Result manager.
4. Comparison with What Next
A fundamental difference between L4All's What Next and the ApproxRelax prototype is that with ApproxRelax the user can pose search queries that are different from their own timelines, e.g., where some episodes in their timeline are not included in the search query, or if included they are not approximated, or where there are episodes in the search query not related to their own timeline. Relaxation in ApproxRelax is also more flexible than in L4All: in L4All the information about each episode is encoded as a single token and the same depth of classification is applied to all types of episodes for similarity matching, whereas in ApproxRelax each query template results in several query conjuncts each of which can be individually approximated or relaxed. As a consequence of this finer level of representation, each classification can be relaxed independently (see Example 3) and answers resulting from fewer relaxations will automatically be ranked higher by the system.
Considering the three issues identified at the end of Section 2.1, the problem of the top-ranked timelines in L4All being very similar to the user's own timeline is avoided in ApproxRelax by not requiring that all of a user's timeline is matched against the timeline data. For example, consider the results from What Next shown in Fig. 15 (from [ 21]). The table shows for each of three users (al1, al3, al4) the similarity of five other timelines, in the order that these are ranked by L4All. Also shown in the fourth column is the number of suggestions of possible future episodes offered to the user (i.e., the episodes colored orange in the selected timeline). One can see here the problem of the top-ranked timelines being very similar to the user's timeline and offering few or no suggestions to the user. The fifth column shows how many of the suggestions offered are actually relevant for this user, as rated by a lifelong learning (LL) practitioner with domain expertise in careers in IT. This person, Pract-1, had worked with diverse groups of mature students and with other LL stakeholders in deriving the requirements of the original L4All system and in undertaking successive user evaluations of the system. Pract-1 was, therefore, intimately familiar with the target user groups for L4All and of their needs from a system that aims to support lifelong learning and career choices.

Fig. 15. Results from What Next in L4All.

In contrast, in collaboration with two LL practitioners, we used ApproxRelax on the same set of timelines, submitting three queries relevant to each of al1, al3, al4 (i.e., nine queries in all). The two LL practitioners were Pract-1, already mentioned, and Pract-2, who is director of the Foundation Degree in IT at Birkbeck, which targets mature students wishing to enter the IT profession or advance their careers in IT. Collectively therefore, they had much expertise and experience in the lifelong learning landscape and in the needs of lifelong learners and were able to provide authoritative feedback in the evaluation of ApproxRelax. For all nine queries, the “Link from previous episode” was “next episode” and the boxes “Fetch similar ” and “Flexible matching of the link ” were ticked. Only the “insert” edit operation was selected to be applied. For al1, the three queries comprised one query template representing their last educational episode and a second one representing either 1) an Information Systems university episode, or 2) an ICT Managers work episode, or 3) a Software Professionals work episode. For al3 and al4, the three queries comprised two query templates representing their last educational episode and last IT-related work episode, and a third one as in points 1-3 above. The same LL practitioner who rated the L4All query results above also rated the relevance of the ApproxRelax query results. For each user, if we add up the number of relevant results being returned ranked 1-5 across their three queries, we obtain the summary shown in Fig. 16. We also show from which timelines these relevant results are returned.

Fig. 16. Relevant results from ApproxRelax.

We see that more relevant results are being returned for all users by ApproxRelax compared with What Next, e.g., user al1 now has relevant results from al2, al3, al5; user al3 from al5; and user al4 from al1, al2, al5. The top-5 relevant results returned by ApproxRelax include all of the relevant suggestions from What Next, except for one relevant suggestion for users al1 and al3 from the timeline of al4 which is returned by ApproxRelax, but not in the top-5 results.
The two LL practitioners reported that they found it “much more useful” to be able to explicitly set up a search query in ApproxRelax rather than using the built-in similarity matching of L4All based on the user's whole timeline. They also found it helpful that ApproxRelax allows users to specify what kind of episode they are looking for, for inspiration, i.e., the episodes 1-3 above. Another advantage of ApproxRelax is a clearer causality between the users' requirements, as articulated in their search query, and the results returned by the system.
Returning to the other issues identified at the end of Section 2.1, the problem of the user having to decide on the level of classification for episode comparisons is avoided in ApproxRelax because relaxation of episode types is performed automatically by the system, with timelines containing episodes matching the user's query in more detail being ranked higher than those matching at higher levels of classification.
The problem of finding the “best” alignment will remain difficult. However, because each query template is represented by several query conjuncts in ApproxRelax, rather than as a single token in L4All, similarity is more finely measured.
5. Related Work
Research into life-course choices has highlighted two issues that contribute to lack of participation in HE: a lack of information about educational opportunities, and a perception that HE is “not for me” [ 23]. Social factors influence educational choices and career decisions (location, family, friends), and word-of-mouth is important in recommending educational choices [ 4]. Learners who receive more personalized and better targeted information may make a more successful entry to FE and HE [ 17]. The L4All system aims to allow potential learning and career pathways to be identified, exposing possibilities that learners may otherwise not have considered [ 4]. Here, we have shown how supporting query approximation and relaxation can provide greater flexibility in users' querying of timeline data than L4All's similarity metrics-based approach, and can provide more relevant answers.
In approximate querying, Kanza and Sagiv [ 10] considered querying semistructured data using flexible matchings in which the matched paths contain the labels in the query. Grahne and Thomo [ 6] used weighted regular transducers to transform regular path queries to match semistructured data approximately. Mandreoli et al. [ 15] allow edges in a query to match paths in a graph that have been semantically related (e.g., using RDFS). Approximate graph matching has also been extensively studied recently, e.g., [ 18].
There has been work on relaxing XML tree pattern queries, recently in [ 14]. Relaxation of conjunctive queries on RDF is considered in [ 5] and [ 8]. Huang et al. [ 7] develop a similarity measure for relaxed queries to improve the relevance of answers. Similarity-based querying was also the focus of iSPARQL [ 11].
In contrast to the above, the work in [ 16] combines within one framework both query approximation and query relaxation, and applies it to the more general query language of conjunctive regular path queries. The ApproxRelax prototype, we have described here builds on the theoretical foundations of [ 16] by implementing for the first time ontology-based relaxation and query approximation for CRP queries. Moreover, ApproxRelax is the first system to provide a visual user interface that allows users to incrementally construct CRP queries and express their approximation and relaxation preferences (this was an area of future work identified in [ 16]).
6. Conclusions
Facilitating the collaborative formulation of learning goals and career aspirations has the potential to enhance learners' engagement with the lifelong learning process. We have described a prototype system called ApproxRelax which provides users with a graphical facility for incrementally constructing search queries over learners' timelines. We have described how the system can be used to construct conjunctive regular path queries over timeline data and metadata, and to allow approximation or relaxation to be applied to selected parts of the user's query. We have discussed how such queries are evaluted, showing how the system returns results ranked in order of their distance from the original query, and is able to provide more relevant answers than L4All's similarity metrics-based approach.
The work that we have presented here is novel both in its aim of supporting lifelong learners in reflecting on their learning and career choices, and also in its technical approach which implements for the first time query approximation and query relaxation techniques for CRP queries, also providing a visual interface for users to incrementally construct their queries and express their approximation and relaxation preferences. Future work includes evaluation of the usability of ApproxRelax with FE/HE learners, and its extension to encompass the missing Qualifications and Industrial classifications, and query templates for Personal and Other episode types. As part of the usability evaluation, we will investigate whether there is a need for the system to provide an explanation of how the overall distance of the query results has been calculated, and how this feedback should be presented to users. Empirical evaluation of our query processing algorithms with realistic volumes of timeline data is still needed, followed by development of appropriate query optimization techniques. A further phase of piloting of the system will then be carried out, prior to exploring the provision of a live service within our institution and more broadly.

A. Poulovassilis and P. Selmer are with the London Knowledge Lab, Birkbeck, University of London, 23-29 Emerald Street, WC1N 3QS London, United Kingdom. E-mail: {ap, lselm01}@dcs.bbk.ac.uk.

P.T. Wood is with the Department of Computer Science and Information Systems, Birkbeck, University of London, Malet Street, WC1E 7HX London, United Kingdom. E-mail: ptw@dcs.bbk.ac.uk.

Manuscript received 1 Apr. 2011; revised 28 Nov. 2011; accepted 6 Dec. 2011; published online 12 Dec. 2011.

For information on obtaining reprints of this article, please send e-mail to: lt@computer.org, and reference IEEECS Log Number TLTSI-2011-04-0041.

Digital Object Identifier no. 10.1109/TLT.2011.38.

2. See www.dcs.shef.ac.uk/~sam/stringmetrics.html.

References

Alexandra Poulovassilis received the MA degree in mathematics from Cambridge University and the MSc and PhD degrees in computer science from Birkbeck. Her research interests center on information management, integration, and personalization. Since 2003, she has been codirector of the London Knowledge Lab, a multidisciplinary research institution which aims to explore the future of knowledge and learning with digital technologies.

Petra Selmer received the BSc degree in computer science from the Rand Afrikaans University (South Africa) and the MSc degree in advanced information systems from Birkbeck. She has been working toward the part-time PhD degree (computer science) since 2008, researching the area of flexible querying of semistructured data. She has held the position of software architect at the Intensive Care National Audit and Research Center since 2004.

Peter T. Wood received the PhD degree in computer science from the University of Toronto in 1989, having previously received BSc and MSc degrees in computer science from the University of Cape Town. His research interests include query languages for various data models, query optimization, active and deductive rule languages, and graph algorithms.