The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - Sept.-Oct. (2013 vol.10)
pp: 1
Published by the IEEE Computer Society
Christian Jungreuthmayer , University of Natural Resources and Life Sciences, Vienna and Austrian Centre of Industrial Biotechnology (ACIB), Vienna
Marie Beurton-Aimar , University of Bordeaux, Talence
Jurgen Zanghellini , University of Natural Resources and Life Sciences, Vienna and Austrian Centre of Industrial Biotechnology (ACIB), Vienna
ABSTRACT
Minimal cut sets are a valuable tool for analyzing metabolic networks and for identifying optimal gene intervention strategies by eliminating unwanted metabolic functions and keeping desired functionality. Minimal cut sets rely on the concept of elementary flux modes which are sets of indivisible metabolic pathways under steady state condition. However, the computation of minimal cut sets is non-trivial, as even medium sized metabolic networks with just 100 reactions easily have several hundred million elementary flux modes. We developed a minimal cut set tool that implements the well known Berge algorithm and utilizes a novel approach to significantly reduce the program run time by using binary bit pattern trees. By using the introduced tree approach the size of metabolic models that can be analyzed and optimized by minimal cut sets is pushed to new and considerably higher limits.
Introduction
Elementary flux mode (EFM) analysis is a well-established method to unbiasedly decompose (metabolic) networks into unique, indecomposable steady-state pathways [ 1], [ 2]. The indecomposability of EFMs implies that the deletion of any single reaction contributing to an EFM completely disables it. A cut set (CS) is defined as a set of reaction deletions that eliminates all target EFMs to be killed. A CS is called a minimal cut set (MCS) if none of its subsets is a CS [ 3]. MCSs are of special interest for metabolic engineering, as they require the least effort when biologically implemented. It has been shown that CSs are especially useful for two types of tasks:

    1. The calculation of all MCSs of an unconstrained system, which means that every determined MCS kills all EFMs of the investigated systems. Numerous applications of this method have been suggested, for example, evaluating structural robustness and fragility, identifying targets in rational drug design, and predicting phenotypes [ 4].

    2. EFMs can be utilized to split the functional units of a metabolic system into two groups: a) desired functions and b) unwanted functions [ 5]. A constrained cut set (cCS) is defined as a set of reaction deletions that eliminates all unwanted EFMs and keeps—at least some of—the desired EFMs. Such a strategy can be used to optimize microbiological production hosts to efficiently produce substances of interest, for example, ethanol [ 6]. This is achieved by assigning all EFMs that are involved in efficient ethanol production to the set of desired EFMs. All other EFMs are assigned to the set of unwanted EFMs. Consequently, the biological implementation of the computed MCSs of such a system results in an optimized microorganism. Therefore, MCSs are a valuable tool to identify and realize optimal gene intervention strategies, as has been shown by Trinh et al. [ 7].

The MCS computation of small metabolic networks is simple and computationally not demanding. However, the computation of MCSs suffers from a major disadvantage. The number of EFMs of metabolic networks grows combinatorially with the systems size [ 8]. Even medium-sized networks with approximately 100 reactions can easily have several hundred million EFMs. The huge number of EFMs makes the computation of MCSs a challenge and results in very long runtimes that do not allow the use of MCSs for medium-scale or large-scale metabolic models. Recently, several methods have been developed to compute efficiently MCSs using hitting set algorithms [ 4], [ 9], [ 3], [ 5], [ 10], [ 11] and binary linear programming [ 12]. The performance evaluation of MCS computation algorithms is nontrivial and currently no methods are known that are output-polynomial [ 10]. In this study, we introduce an improved version of the hitting set approach that was originally reported by Haus et al. [ 3]. The approach of Haus et al. significantly increased the performance of MCS programs by utilizing Berge's algorithm [ 13].
The Berge algorithm mainly processes three data sets:

    1. the set of all EFMs to be killed,

    2. the set of preminimal cut sets (preMCS), which kill all EFMs that have already been processed, and

    3. the new preMCS candidates.

The Berge algorithm executes two nested loops (see Fig. 1). The outer loop iterates through all EFMs that have to be killed. The inner loop runs over all existing preMCSs. Inside the inner loop, it is tested if the current preMCS kills the EFM that is being analyzed. If the EFM is killed, then the preMCS is valid and the next preMCS is tested. However, if the EFM is not eliminated by the tested preMCS, then the preMCS is invalid. In this case, the preMCS is used to generate new preMCS candidates and, then, it is removed from the set of preMCSs. New candidates are simply created by taking the invalid preMCS and by adding a single element for each reaction that the processed EFM carries a flux. For example, if an insufficient preMCS contains two reactions ({R1, R7}) and the EFM that caused the preMCS to fail the test has four flux-carrying reactions ({R4, R6, R9, R11}), four new candidates are created with three reactions each ({R1, R7, R4}, {R1, R7, R6}, {R1, R7, R9}, {R1, R7, R11}). After the last EFM has been processed, all remaining preMCSs are MCSs that kill all EFMs that have to be deleted. However, the Berge algorithm suffers from a bottleneck, which is caused by the need to remove preMCS candidates that are supersets of already computed preMCS [ 3]. Consequently, a superset test is required to guarantee that all CSs are MCSs. Using a linear search approach to find subsets in the set of existing preMCS results in a search time of $O$$(n^2)$ . Various other subset search approaches that scale better than the linear approach have been reported, for example, in [ 14] and [ 15] methods were reported that can identify subsets in time $O$$(n^2/\log n)$ . To reduce the runtime spent in the subset search of Berge's algorithm, we implemented a binary bit pattern tree approach [ 16] that can be run in parallel on multiple CPU cores. The concept of binary bit pattern trees is not new, but we are not aware that they have been used in the context of MCS computation. By using our tree approach, the size of metabolic models that can be analyzed and genetically optimized by MCSs is pushed to new and significantly higher limits. Besides the importance of MCSs in metabolic engineering, the computation of MCSs is also an essential problem in discrete mathematics with many applications in computer science, artificial intelligence, and game theory [ 17]. These fields may also benefit from the idea of using binary bit pattern trees to speed up the computation of MCSs. In this study, we explain the main concept of our tree method and compare the performance of the tree method with a linear search algorithm.


Fig. 1. Flowchart of the essential parts of the Berge algorithm.




2. Methods
Our MCS calculator utilizes the Berge algorithm [ 13] as demonstrated by Haus et al. [ 3]. In Berge's algorithm, the majority of the computation time is spent on the procedure that checks whether or not a new MCS candidate is a superset of an already determined preMCS [ 3] (see Table 2). If this is the case, then the new preMCS candidate is dismissed. Otherwise, the candidate is added to the list of preMCS.
A simple approach to perform this superset test is to sequentially access each preMCS of the complete set of existing preMCS and to compare it with the new preMCS candidate. As soon as a subset is found for a new candidate, the search procedure is stopped and the next candidate is tested.
However, this approach scales badly with an increasing number of preMCS. In particular, this is true if there do not exist any subsets for a candidate, in which case the complete set of existing preMCSs is examined. Note that the number of preMCSs can be huge and, in general, is much larger than the final number of computed MCSs (see Fig. 2, which shows the number of preMCSs as a function of processed EFMs for the presented benchmark system).


Fig. 2. Number of preMCS over number of processed EFMs for the system used throughout this paper.




Typically, CSs are expressed in form of bit patterns, where “1” stands for a reaction that is knocked out by the CS and “0” means that the corresponding reactions is not affected. A new candidate is a superset of an existing preMCS, if the result of a Boolean AND-operation of the candidate and the preMCS is equal to the preMCS itself (see Table 1). This implies that a candidate is not a superset if the bit pattern of a preMCS contains a “1” at a bit position, where the candidate is “0.” This fact can be used to speed up the superset check of the Berge algorithm by organizing the set of existing preMCSs in a favorable way as a binary tree and by traversing through this tree in an intelligent fashion. Thereby, a significant amount of superset tests can be avoided that would have to be performed in the case of a linear search.

Table 1. Implementation of the Superset Test Utilizing a Boolean AND-Operation


As shown in Fig. 3, our implementation of the Berge algorithm uses a binary tree to structure the preMCSs. A leaf node of the tree is always the bit pattern of a preMCS and is represented by rectangles with rounded corners. Whether a preMCS is attached left or right to its parent node is determined by the bit value of the preMCS at a certain bit position. In Fig. 3 a, preMCS is attached to the left side if the bit at the inspected bit position is set and to the right side if it is not set. For example, the bit position of the root node is 3. Consequently, all preMCSs with a “1” at position 3 are attached to the subtree at the left side of the root node, whereas all preMCSs on the right side of the root node are “0” at bit position 3. The bit position that is used at each level of the tree (e.g., in Fig. 3 at level 0 bit number 3, at level 1 bit number 4, at level 2 bit number 2, and so on) is determined during a preprocessing step before the Berge algorithm is started and has a significant influence on the runtime of the program. The bit pattern value of a nonleaf node is created by Boolean AND-operations of the bit patterns of the node's subtrees. This AND-operation has the effect that “1”s that are set in all subnodes are propagated from the bottom of the tree (leaf nodes) to the top (root node). If a candidate is checked, the superset test is started on the root node. The left subtree of a node is always investigated first. If no subset could be found in the left subtree, the right subtree is explored. However, the left side of a subtree is only entered if the tested candidate does not contain a “0,” where all subnodes and, hence, all preMCS have a “1.” Such a case is shown in Fig. 3, where the candidate “100101” is tested (blue-colored dash-dotted line). As the bit pattern of the root nodes (“000000”) does not contain any “1”s, the left subtree of the root node is entered. The bit pattern of the left child node (“001000”) indicates that in all nodes and, hence, in all preMCSs attached to it the bit at position 3 is set. As bit number 3 of the tested candidate “100101” is not set, none of the attached preMCSs can be a subset of the candidate and this subtree is not further investigated. Therefore, the right subtree of the root node is entered next. In this subtree, all leaf nodes are accessed to find a subset. As no subset is found, the candidate is accepted and can be added to the set of preMCSs. A different situation is illustrated by the candidate “011101” (red-colored dashed line). Since the bit at position 3 is set, the left subtree of the nonleaf node “001000” must also be checked. However, this check does not return a valid subset. Hence, the right subtree is investigated. In this subtree, a subset is detected (“011001”). Consequently, the check procedure is immediately terminated and the candidate is dismissed.


Fig. 3. Binary bit pattern tree of all existing preMCS. Leaf nodes containing bit patterns of preMCS are represented by rectangles with rounded corners. The bit patterns of the nonleaf nodes are built by Boolean AND-operations of the node's subtrees.




Before the Berge algorithm is started, extensive preprocessing of the set of provided EFMs is performed. The preprocessing comprises the following steps:

    1. eliminating reactions that must not be knocked out, as this would result in the deletion of (too many) wanted modes,

    2. removing duplicate modes,

    3. removing modes that are supersets of other modes, and

    4. combining duplicate reactions.

The preprocessing step 4 is illustrated in Fig. 1 as the step “compress EFMs,” which is the reverse operation of the step “expand MCS.” For details, we refer to [ 18].
Our MCS calculator is written in C and supports multithreaded execution. Multithreading was implemented by utilizing the POSIX thread library. Each started thread gets an equally sized set of preMCS candidates, which it tests against all existing preMCSs. The existing preMCS are stored in a global memory region that can be accessed by all threads. Note that only read-access to the preMCS data is required by the threads during the superset test. The tree management is done by a single thread and, hence, does not benefit from multithreaded execution. A Makefile is provided with the source code to allow compilation of the program. The program was developed and tested only on Linux and Mac OS X platforms. Other operating systems are currently not supported. The software is available from the authors on request.
To benchmark the performance of our bit pattern tree algorithm, we used a system with 114,614 EFMs. The EFMs were calculated with the open source program regEfmtool [ 19], [ 20]. The used network has been described in [ 21]. It models the central carbon metabolism of generic plant cells and consists of 78 reactions (33 reversible and 45 irreversible) and 55 internal metabolites. The network describes the following main pathways: TCA cycle, glycolysis, pentose phosphate, respiration, sucrose, and starch synthesis. All of them belong to one of these four compartments: vacuole, mitochondria, plastid, and cytosol. Transporters have been added for intercompartmental exchange. For the benchmark, we computed MCSs of the unconstrained system, which means that all EFMs of the system have to be killed. An unconstrained system has the effect that the preprocessing steps 1 to 3 mentioned earlier cannot be applied and, hence, the size of the system is only marginally reduced during preprocessing. Consequently, using an unconstrained system results in a high computational workload. The total number of MCSs of the chosen system is 2,815,375, where the minimum and maximum number of knockouts is 4 and 18, respectively (see Fig. 5). However, because of preprocessing step 4, which combines duplicate reactions, the Berge algorithm only needs to compute 93,009 MCSs. In a postprocessing step these 93,009 MCSs are expanded to the final number of 2,815,375. The concept of CS expanding is illustrated in Fig. 4, which shows two networks. Network (a) depicts the original toy system that has two EFMs and seven MCSs that remove both EFMs: $\{R1\}$ , $\{R4\}$ , $\{R5\}$ , $\{R2,R6\}$ , $\{R3,R6\}$ , $\{R2,R7\}$ , and $\{R3,R7\}$ . Network (b) is the toy system after combining duplicate reactions, which only has three MCS: $\{R1\}$ , $\{R4\}$ , and $\{R23,R67\}$ . However, in a postprocessing step the three MCSs can be expanded to obtain the full set of MCSs of the original network. Using the compressed network instead of the full network results in a strong reduction of the program's total execution time [ 18].


Fig. 4. Network (a) is the original system and network (b) the system after combining duplicate reactions (R2 and R3 R23; R4 and R5 R45; R6 and R7 R67). Both networks have two EFMs. However, network (a) has seven MCSs, whereas network (b) has only three MCSs. In a postprocessing step, the three MCSs of the compressed network can be expanded to the seven MCSs of the original network. Using the compressed network to do the MCS computation results in strong reduction of the total runtime.






Fig. 5. Number of MCSs as a function of number of knockouts for the metabolic system used in the presented benchmark.




3. Results
The results of the benchmark are listed in Table 2. The table clearly shows that the tree approach is superior to the linear search algorithm, for example, in single-thread mode the tree algorithm is approximately 30 times faster than the linear search algorithm. The performance gain is even higher if the MCSs of even bigger systems are computed (see Table 3). The benchmark also shows that most of the time is spent on the superset test (at least 84 percent), which is consistent with the observations of others [ 3]. The time lost managing the bit pattern tree accumulated to approximately 40 seconds. A part of these 40 seconds is spent in the candidate generation procedure, where invalid preMCSs are eliminated and, hence, must be removed from the tree. The other task of the tree management is to add valid candidates to the preMCS tree.

Table 2. Comparison of MCS Computation Runs with (a) Linear Search, (b) Tree Search with Random Bitorder, and (c) Tree Search with a Bit Order Derived from EFM Properties for a System with 114,614 EFMs


The results of test (b) were obtained by averaging over five individual runs each with varying order of bit position. The results of (a) and (b) were averaged of three identical runs. Note that the tree management mainly comprises two tasks: 1) removing insufficient preMCS and 2) adding new valid candidates to the preMCS tree. The tree management time listed in this table shows the sum of both tasks. However, the time spent for removing insufficient preMCS also shows up in the candidate generation time.

As shown in Table 2, the bit position used to create the tree (see Fig. 3) has a strong influence on the total runtime (compare test case (b) and test case (c)). In our benchmark, the best performance was achieved when the order of the bit position was determined by the frequency of active reactions in the set of EFMs to be killed. The more often a certain reaction was carrying a flux in the EFMs, the higher up in the tree the reaction was used to split the set of preMCSs. For example, given in Fig. 3, this means that the reaction represented by bit position 2 occurs most frequently in the EFMs to be killed. The results obtained by using this EFM statistics approach to determine the bit order are shown in test case (c)  tree—EFM. Test case (b) tree—random used a random bit order. For test case (b), five runs with varying bit orders were used and the results averaged. To obtain comparable results, the same random bit order was used for each set of runs with varying numbers of threads. Using the bit order derived from the reaction frequency resulted in a performance gain of approximately a factor 3 compared to the random bit order.
The achieved performance improvement can be explained by the number of comparison events that have to be performed (see Table 2). In the linear case $20.4\cdot 10^{12}$ subset tests were done in total, whereas only $30.9\cdot 10^{9}$ were required for the tree approach shown in test case (c). Although, in test case (a) 660 times more comparison events occurred than in test case (c), the runtime improvement was much lower than the factor 660. This is mainly caused by the additional complexity of the tree code and by unavoidable effects such as an increased number of cache mismatches, which occur more frequently if memory is accessed randomly instead of linearly.
Note that Table 2 also illustrates that the linear search scales much better with an increasing number of threads than the tree approach, as the total runtime for the linear search was reduced by a factor of 7.1 when 10 threads were used, whereas the tree implementation only gained a factor of 1.7. Analyzing and improving the multithreaded tree code implementation will be scope for future work.
Results of MCS computations for five metabolic networks are listed in Table 3. The table clearly shows that the performance gain grows with the number of preMCSs that have to be determined during the computation. The number of EFMs is not necessarily indicative of the number of MCSs. Our P. pastoris model has 15 million EFMs and roughly 1.8 million MCS, while our A. thaliana contains only 1.7 million EFMs but has 3.7 billion MCSs. Note that we terminated the linear search runs of the two large systems, as the execution time was extraordinarily high. Based on the average time spent to search for a single candidate in the set of preMCSs, an extrapolated runtime was estimated. Table 3 shows that the tree approach outperforms the linear search algorithm. Moreover, the tree approach allows to study metabolic systems that could not be analyzed with a linear search.

Table 3. Comparison of MCS Computation with Linear Search and with Tree Search for Different Metabolic Networks Using 10 Parallel Threads


The model $P.\; pastoris^a$ contains the core metabolism and pathways for amino acid, nucleotides, and riboflavin production. The network $P.\; pastoris^b$ also contains the core and amino acid metabolism, but any cofactor coupling has been removed from the model. Consequently, $P.\; pastoris^b$ is only a numerical test network and not a proper biological model. $\dagger$ extrapolated data, case terminated after 77 h at 14,900 of 1,406,983 iterations. $\ddagger$ extrapolated data, case terminated after 121 h at 98 of 1,720,557 iterations.

4. Conclusion
We presented a program that computes MCSs by utilizing a hitting set algorithm originally invented by Berge [ 3]. The bottleneck of the Berge algorithm is the search procedure that is required to guarantee that all computed MCSs are truly minimal. This is done by performing a test that verifies that a new MCS candidate is not a superset of any of the already determined preMCS [ 5]. A simple but slow approach to test a new candidate against all already existing preMCSs is to linearly iterate through the complete set of existing preMCSs and check if this set contains a subset of the candidate. To speed up the superset test, we stored the existing preMCSs in a binary tree structure. This tree structure can be used to skip a significant amount of the necessary comparison events between a candidate and the existing preMCSs. Consequently, the total runtime for the computation of the complete set of MCSs is strongly reduced, and hence, numerical analyses and gene intervention strategies of much bigger systems can be calculated by our approach. Note that the determination of the MCSs requires the calculation of the EFMs first. Even though significant progress has been made in recent years on the efficient computation of EFMs [ 16], [ 22], [ 23], in many situations the EFM computation is the limiting factor of MCS analyses. However, recently alternative approaches have been reported which are based on the duality of EFMs and MCSs [ 3], [ 24]. These approaches allow the computation of MCSs directly from the stoichiometric matrix without having to calculate the EFMs beforehand.
The Berge algorithm is well suited to compute MCSs of systems with a moderate number of columns/reactions (up to several hundred) and a high number of rows/modes (several million). However, systems with other characteristics (e.g., a high number of columns and a low number of rows) cannot be investigated easily with the Berge algorithm—even if a tree approach is used—as the number of preMCSs grows dramatically with each processed row/mode. The huge number of preMCSs results in a tremendous memory consumption and a very slow execution that make the Berge algorithm unsuitable for these type of systems.
Besides the importance of MCSs in metabolic engineering, the calculation of MCSs is also an essential problem in discrete mathematics with numerous applications in computer science, artificial intelligence, and game theory [ 17]. These fields might also benefit from the idea of using binary bit pattern trees to speed up the computation of MCSs.

Acknowledgments

This work was supported by the Federal Ministry of Economy, Family and Youth (bmwfj), the Federal Ministry of Traffic, Innovation and Technology (bmvit), the Styrian Business Promotion Agency SFG, the Standortagentur Tirol, and ZIT— Technology Agency of the City of Vienna through the COMET-Funding Program managed by the Austrian Research Promotion Agency FFG.

    C. Jungreuthmayer and J. Zanghellini are with the Department of Biotechnology, University of Natural Resources and Life Sciences, Vienna, Austria, and with the Metabolic Modeling Group, Austrian Centre of Industrial Biotechnology (ACIB), Muthgasse 11/DG, 1190 Vienna, Austria. E-mail: christian.jungreuthmayer@acib.at.

    M. Beurton-Aimar is with the Laboratoire Bordelais de Recherche en Informatique (LABRI), University of Bordeaux, 351, cours de la Liberation, Batiment A30, Bureau 310, Talence 33400, France.

Manuscript received 3 Apr. 2013; revised 9 July 2013; accepted 4 Sept. 2013; published online 18 Sept. 2013.

For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2013-04-0108.

Digital Object Identifier no. 10.1109/TCBB.2013.116.

References

24 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool