, Robert Bosch LLC
, Robert Bosch LLC
, Robert Bosch Start-up GmbH
Abstract—Over the last two decades, manufacturing across the globe has evolved to be more intel-ligent and data driven. In the age of industrial Internet of Things, a smart production unit can be perceived as a large connected industrial system of materials, parts, machines, tools, inventory, and logistics that can relay data and communicate with each other. While, traditionally, the focus has been on machine health and predictive maintenance, the manufacturing industry has also started focusing on analyzing data from the entire production line. These applications bring a new set of analytics challenges. Unlike tradi-tional data mining analysis, which consists of lean datasets (that is, datasets with few fea-tures), manufacturing has fat datasets. In addition, previous approaches to manufacturing analytics restricted themselves to small time periods of data. The latest advances in big data analytics allows researchers to do a deep dive into years of data. Bosch collects and utilizes all available information about its products to increase its understanding of complex linear and nonlinear relationships between parts, machines, and assembly lines. This helps in use cases such as the discovery of the root cause of internal defects. This article presents a case study and provides detail about challenges and approaches in data extraction, modeling, and visualization.
Keywords—big data; analytics; industrial Internet of Things; manufacturing
Ever since the Industrial Revolution, increased efficiency in manufacturing has been a constant endeavor. What started as the dawn of mechanization and transformed into electricity-powered operations has subsequently experienced the power of digitalization and power electronics (such as programmable logic controllers and distributed supervisory control and data-acquisition systems). However, much of the data collected was used only for direct feedback control, in real time, and for forensic purposes, when archived. Recently, the manufacturing industry has embarked upon yet another transformation, sparked by connectivity and advanced analytics. This is referred to as advanced manufacturing in North America and industry 4.0 in Europe.
Given the trend toward highly personalized goods, shortened delivery times, and manufacturers’ increased exposure to liability issues, connectivity and analytics are seen as key enablers. On one hand, this transformation facilitates flexibility and agility, particularly with respect to mass-produced discrete goods. On the other hand, it also permits the delivery of products with improved quality at the same or lower cost by leveraging the data collected by the various elements of a connected assembly line.
The data collected brings transparency about the machines’ operations, the materials utilized, the facility logistics, and even the human operators. This transparency is brought about by the application of data analytics, which refers to the use of statistics and machine learning methods to discover distinct data characteristics and patterns. Machine learning techniques are increasingly used in various manufacturing applications,1–6 such as warranty claim and internal defect reduction, predictive maintenance, test time reduction, supply chain optimization, and process flow optimization.7
In this article, we describe opportunities for the use of analytics in manufacturing and present a case study of a successful application at one of Bosch's large-scale manufacturing facilities. Building on our experience in more than two scores of such applications, we present our recommendations for replications in other operations and share our preferred technology stack. Finally, we provide a view of future developments in this area, including the coevolution of technologies and solutions in the broader Internet of Things (IoT) space.
The overarching goal of using analytics in manufacturing is to improve productivity by reducing costs without compromising quality. This, in turn, makes the manufacturing process efficient. Figure 1 highlights some key performance indices that advanced manufacturing can help improve. Figure 2 lists just a few of the many opportunities where analytics can be used.
Figure 1. Some key performance indices that advanced analytics can help improve in manufacturing (OEE: overall equipment efficiency).
Figure 2. Opportunities for the use of analytics in advanced manufacturing.
Broadly speaking, we see a recurrent need for analytics in five categories:
Please note that although use cases such as inventory management and logistics optimization might be considered part of advanced manufacturing, they are beyond this article's scope.
Traditional quality improvement programs include Six Sigma, Deming cycle, total quality management (TQM), and Dorian Shainin's statistical engineering (SE).8 Although these programs use statistics to some extent, most methods developed in the 1980s and 1990s are typically applied to small amounts of data and find univariate relationships between participating factors. The simplification of data processing in large clusters using the MapReduce paradigm9 and further developments thereon led to the mainstream diffusion of big data analytics. Advances in big data analysis along with machine learning techniques have offered a wide array of new tools that can potentially be applied in manufacturing analytics. These include the ability to analyze terabytes of data in both batch and streaming modes, the ability to find complex multivariate nonlinear relationships between many variables, and machine learning algorithms differentiating causation from correlation.
With millions of parts being produced on manufacturing lines, and with thousands of process and quality measurements collected for each of them, improving quality and reducing cost is nontrivial. Design of experiments (DoE), which iteratively explores thousands of causes via controlled experiments, is often too time-consuming and cost prohibitive. Most often, manufacturing experts depend on their domain knowledge to detect the likely key factors affecting quality and then run DoE on these few factors. Advances in big data analytics and machine learning enable the detection of key factors affecting quality and yield efficiently. This, coupled with domain knowledge, enables quick detection of root causes of failures. However, there are some unique data science challenges in manufacturing.
First is the unequal costs of false alarms and false negatives. While computing the accuracy rate, one has to be cognizant of possible unequal costs of false alarms and false negatives. Let us assume that a false negative is a bad part/instance that has been incorrectly predicted as good. Also, let us assume that a false alarm is a good part that has been incorrectly predicted as bad. Let us further assume that the part being manufactured is safety critical. In this scenario, predicting a bad part incorrectly as good (false negative) could put someone's life in danger. On the other hand, a false positive might only mean a slight decrease in the overall yield. Hence, the cost associated with false negatives might be much higher than that with false alarms. Such tradeoffs need to be considered while converting the business goals into technical goals and evaluation of candidate approaches.
The second challenge involves data collection and traceability issues. Data collection issues occur often. Many assembly lines lack “end-to-end traceability.” In other words, often there is no unique identifier associated with all the processing steps and components of a part being produced. One workaround is using timestamps as a substitute for identifiers. Another scenario involves an incomplete dataset. In this case, parts or instances with incomplete information are omitted in predictions and analytics or some method of imputation is used (after consulting the manufacturing experts).
The large number of features is another challenge. Unlike traditional datasets in data mining, which have few features but many instances, features observed in manufacturing analytics might number in the thousands. Therefore, care must be taken to avoid machine learning algorithms that can only work with lean datasets (that is, datasets with few features).
A fourth challenge is multicollinearity. Often as a product passes through an assembly line, different measurements are taken in the different stations of the production stream. Some of these measurements can be highly correlated. However, many machine learning and data mining algorithms assume that the features are independent of each other. The issue of multicollinearity should be carefully addressed for the analytics methods proposed.
The class imbalance problem is another challenge. Usually, there is a great imbalance between the good and bad parts (or scrap, that is, parts that do not pass quality control tests). The ratio might range from 9:1 to even less than 99 million:1. Thus, applying standard classification techniques to differentiate good parts from scrap is difficult. Several methods have been proposed to deal with class imbalance and should be used for manufacturing analytics.10
A sixth challenge is nonstationary data. The underlying manufacturing process might change due to various factors, such as change of suppliers or operators and calibration drift in machines. Hence, methods that are more robust to the data's nonstationary nature need to be applied.11
Finally, models can be difficult to interpret. The analytics solutions that inform process or design changes need to be understood by the manufacturing and quality control engineers. Otherwise, the generated recommendations and decisions might be ignored.
Once data is collected from different devices and stored in databases, a framework for manufacturing data analytics is needed. Figure 3 shows a framework we have adopted. The data is first extracted, transformed, and loaded (ETL) from different databases into a distributed file system such as the Hadoop distributed file system (HDFS) or a NoSQL database like MongoDB. Next, machine learning and analytics tools perform predictive modeling or descriptive analytics. For deploying predictive models, the aforementioned tools are used to transform the model that is trained on historical data into an open, encapsulated representation of statistical and data mining models and associated metadata called Predictive Model Markup Language (PMML) and store it in a scoring engine. Any new data coming from the machines is evaluated using the model stored in the scoring engine.
Figure 3. Environment for analyzing data coming from different machines and databases. Note that the devices providing data can include Internet of Things (IoT) devices.
The big data software stack used for manufacturing analytics can be a mixture of open source, commercial, and proprietary tools, such as the stack in Figure 4.
Figure 4. An example of a software stack for manufacturing analytics.
Our key takeaway from the completed projects is that full stack vendors do not currently offer a complete solution. Although the technology landscape is evolving rapidly, the best current option is a modular, best-of-breed approach. The focus is on truly distributed components, and the central idea for success is to hybridize open source and commercial components.
Besides the best-of-breed architecture presented here, various commercial IoT platforms are also available. These include GE's Predix (www.predix.com), Bosch's IoT Suite (www.bosch-iot-suite.com), IBM's Bluemix (www.ibm.com/cloud-computing/bluemix), ABB's IoT Services and People platform based on Microsoft Azure (https://azure.microsoft.com), and Amazon's IoT cloud (https://aws.amazon.com/iot). Such platforms offer many features for IoT and analytics as standard services, including identity management and data security, that are not addressed in the case study presented here. On the other hand, the best-of-breed approach provides flexibility and tailored functionalities that make the implementation more efficient than a standard commercial solution. However, implementing such a solution might require the availability of a capable data science team at the implementation site. Hence, the choice boils down to several factors: nonfunctional requirements, cost, and IoT and analytics expertise.
Next we describe a use case that exemplifies these points.
Any product that is assembled or produced in a plant undergoes a series of quality tests that determine whether it has to be scrapped or not. High scrap production rates result from the opportunity cost of not delivering products to customers in a timely manner, loss of personnel time, waste of nonreusable components, and the facilities management cost. Because scrap rate reduction is one of the primary problems manufacturers need to solve, we focus our discussion on this topic. A method to reduce scrap involves identifying the root causes for the low quality of products.
Root-cause analysis starts with consolidating all available data from the production line. Industrial production units constitute several assembly lines, stations, and machines that can be deemed equivalent to an IoT sensor network. During the manufacturing process, information about the status of the process, the states of machines, tools, and parts produced is continuously relayed and stored. The volume, scale, and frequency of production from the facility considered in this case study is so high that it necessitates the use of a big data tool stack, similar to the one shown in Figure 4, to stream, store, preprocess, and join the data. This data pipeline helps us build machine learning models on batch historic data as well as streaming live data. While batch data analytics helps us identify problems in manufacturing retrospectively, streaming data analytics gives plant engineers access to the most recent problems and their root causes at regular time intervals. We use Kafka (https://kafka.apache.org) and Spark streaming (http://spark.apache.org/streaming) to stream live data from different data sources; Hadoop (http://hadoop.apache.org) and HBase (https://hbase.apache.org) to efficiently store the data; and Spark (http://spark.apache.org) and MapReduce frameworks to analyze the data in a distributed fashion.
Two primary reasons for using these tools are their availability as open source offerings and their large and active developer networks that continuously update these tools.
Figure 3 gives an overview of the entire analytics platform where the data feeds into a machine learning stack.
With the increase in distributed com-puting tools such as Spark MLLib (http://spark.apache.org/mllib) and SparkR (http://spark.apache.org/docs/latest/index.html), it has become easier to implement distributed and online machine learning models such as support vector machines, gradient boosted trees, and decision trees on large amounts of data. For root-cause analysis, we test the influence of different machine parameters and process measurements on the overall product quality. A plethora of methods, from correlation analysis to ANOVA and chi-squared hypothesis tests, help identify the influence of individual measurements on the product quality. However, to find the true underlying relationships, which are often complex and nonlinear, we train several classification and regression models that can distinguish between parts that pass quality control and those that do not. We can use the trained models to infer decision rules that are indicative of the causes of high scrap rates. We focus on rules that have the highest purity where purity is defined as Nb/N, where N is number of products that satisfy the rule, and Nb is the total number of defective or bad parts that satisfy the rule.
Although these models can identify linear and nonlinear relationships between variables, they might not indicate a causal relationship. However, causality is essential in identifying the true root causes. We use Bayesian causal models12 to infer the causality using all the data and verify and validate them with their inputs.
A visualization platform for the big data that is gathered is essential. A major challenge engineers face is that they do not have a lucid yet comprehensive overview of the complete manufacturing setup. Such an overview will help them make decisions and assess the status live, before any undesirable event occurs. Descriptive analysis, as shown in Figure 3, helps achieve this using tools such as Tableau (www.tableau.com) and Microsoft BI (https://powerbi.microsoft.com/en-us). Descriptive analysis comprises many views, including histograms, dual variable plots, and correlations.
Apart from visualizing the descriptive statistics, we also provide a clean visual interface to all the predictive models trained. For example, all the measurements that influence a particular quality parameter can be visualized where the data in the back end can be filtered by time till date.
Connected manufacturing is experiencing a technology revolution that, in turn, is expected to revolutionize the rest of the industry. For example, it will encumber suppliers to make similar technological upgrades and share end-product upgrades. At the other end of the spectrum, users will demand increased personalization and many consumer electronics features in all products. This will close the loop between design, manufacturing, marketing, sales, and postmarket tracking/surveillance. Big data and the associated analytics will be the key technology to draw out the required knowledge and provide intelligence along the continuum of engineering processes.
In future work, we plan to use machine learning as a collaborative tool for process engineers. This will also highlight the need for engineers to pursue lifelong education and expansion of skills.
We thank the editor and the anonymous reviewer for their valuable feedback. We also thank the Bosch manufacturing plant that provided the data for its deep insights and collaboration.