Advanced Data Analytics
Guest Editors’ Introduction • Giri Kumar Tayi and P. Radha Krishna • October 2017
Translations by Osvaldo Perez and Tiejun Huang
Listen to the Guest Editors’ Introduction
English (Steve Woods):
Spanish (Martin Omana):
Organizations and companies have been using basic data analytics for years to uncover simple insights and trends. The appetite for more data and better analytics has grown over the years, and now most modern organizations track and record nearly all types of data: transactional, clickstream, social media, audio, video, sensor, text, image, and so on. This ever-increasing volume of data, along with the diversity of data sources, makes the process of extracting useful information and insights an ever more challenging and complex endeavor.
To meet this challenge, organizations and companies have turned to advanced data analytics as an overarching approach to finding the value hidden in the mountains of data that they are rapidly accumulating. Gartner defines advanced data analytics as autonomous or semi-autonomous data and content examination using sophisticated quantitative and qualitative techniques and tools with the goal of discovering deep insights and subtle patterns and of making predictions and recommendations. These techniques tend to be interdisciplinary and span fields such as
- data mining,
- machine learning,
- pattern matching,
- visualization and simulation,
- semantic analysis,
- sentiment analysis,
- network and cluster analysis,
- multivariate statistics,
- graph analysis,
- complex event processing, and
- neural networks.
This October 2017 Computing Now theme issue presents six papers that cover the latest advances across the spectrum of data analytics tools, techniques, and applications. The two videos provide insights into data analytics as an emerging discipline, the benefits and challenges of using data analytics in industry, and what the field’s future might hold.
Most standard learning algorithms assume or expect datasets to have balanced class distributions or equal misclassification costs. However, in “Learning from Imbalanced Data,” Edwardo A. Garcia and Haibo He argues that standard learning algorithms fail to properly represent the distributive characteristics of datasets in some fields (for example, biomedicine) that exhibit unequal distribution between classes. This article aims to provide a survey of the current research developments about the imbalanced learning problem and review state-of-the-art solutions. It also highlights opportunities and challenges for learning from imbalanced data.
Extracting valuable information from petabytes of data requires new clustering algorithms that are scalable, less computationally intensive, and ready for implementation on optimized, large-scale interactive computational frameworks. In “Fuzzy Based Scalable Clustering Algorithms for Handling Big Data Using Apache Spark,” Neha Bharill, Aruna Tiwari, and Aayushi Malviya develop an algorithm for implementation on an Apache Spark Cluster to address the challenges associated with big data clustering. The authors point out that their work achieves a significant reduction in runtime for the clustering of huge amounts of data, without compromising the quality of clustering results. Optimization techniques eliminate the need for storing large membership data matrices during the execution of the proposed algorithm, resulting in shorter runtime.
“Massive Social Network Analysis: Mining Twitter for Social Good” analyzes Twitter’s vast quantity of unstructured data. David Ediger and his colleagues present GraphCT, a Graph Characterization Toolkit for massive graphs representing social network data. GraphCT analyzed graphs representing Twitter’s public crisis data stream and revealed interesting characteristics of Twitter users’ interactions. This allows for identifying influential sources and ranking conversations, thus enabling analysts to focus on a manageable number of conversations.
Zeqian Shen, Kwan-Liu Ma, and Tina Eliassi-Rad present a visual analytics tool called OntoVis for analyzing large, heterogeneous social networks in “Visual Analysis of Large Heterogeneous Social Networks by Semantic and Structural Abstraction.” An auxiliary graph called an ontology graph, which describes the relationships between actors in a network and is generally much smaller than the social network, guides the analysis. Case studies illustrate several of OntoVis’s unique features and capabilities.
Healthcare policy is one of the most prominent data analytics application areas. In "Improving Healthcare with Interactive Visualization," Bradford W. Hesse, Ben Shneiderman, and Catherine Plaisant map the healthcare information into three domains — personal health, clinical health, and public health — and highlight the central role of information visualization and visual analytics in enabling patients, clinicians, and public health policymakers to make better decisions. The article outlines seven practical challenges in the three health domains and highlights the opportunities for information visualization tools, techniques, and analytics that could help mitigate these challenges and lead to improved healthcare.
Alexander Brodsky and his colleagues consider smart manufacturing a data analytics application area in “Analysis and Optimization in Smart Manufacturing based on a Reusable Knowledge Base for Process Performance Models.” They posit that to analyze the performance of complex production lines (such as car production lines), different types of analysis and optimization capabilities — such as descriptive, diagnostic, predictive, and prescriptive analytics — are needed. Each of these capabilities is based on a variety of data that is filtered and aggregated over time and space; for example, descriptive analytics uses temporal sensor data that include line speeds, CO₂ emissions, and water consumption. The article proposes an architectural design and framework for fast development of software solutions for descriptive, diagnostic, predictive, and prescriptive analytics of dynamic production processes.
Mukesh Mohania of IBM Research Labs, Australia, outlines the evolution from descriptive and predictive analytics to cognitive and prescriptive analytics.
Sitarama B. Gunturi of Tata Consultancy Services describes how increasing digitization is resulting in huge volumes of unstructured data in the form of text, images, audio, and video.
The first video from Mukesh Mohania of IBM Research Labs, Australia, outlines the evolution from descriptive and predictive analytics to cognitive and prescriptive analytics. Traditionally, data analytics answers simple questions from structured data, such as how many customers cancelled their accounts (descriptive) and which customers are likely to cancel their accounts next month (predictive). However, ideally, businesses would like to know why these customers are cancelling their accounts (cognitive) and what can prevent cancellations (prescriptive). The answers to these complex questions lie in analyzing unstructured and unconventional data.
In the second video, Sitarama B. Gunturi of Tata Consultancy Services describes how increasing digitization is resulting in huge volumes of unstructured data in the form of text, images, audio, and video. This has led to a paradigm shift in the way analytics is practiced: from traditional statistics to machine learning and artificial intelligence. Additionally, the availability of open source tools is contributing to analytics’ increasing popularity in industry and academia.
As the world moves rapidly into the digital age, individuals, organizations, and companies are being inundated with data. Advanced data analytics offers a plethora of opportunities for researchers, policy analysts, and business managers to innovate and develop tools, techniques, strategies, policies, and software products to extract valuable insights from data. We hope this Computing Now theme issue inspires more research in this rich field.
Giri Kumar Tayi is a professor of management science and information systems at the State University of New York at Albany. He has a PhD from Carnegie Mellon University, and his research interests span information systems, operations management, and operations research. Tayi serves on the editorial boards of several top-tier publications, including Computing Now, and has co-guest edited nine special issues for various academic journals. Contact him at firstname.lastname@example.org.
P. Radha Krishna is a principal research scientist in the Big Data and Analytics unit of Infosys Limited, India, as well as an adjunct faculty member at the National Institute of Technology, Warangal, India. He earned double PhDs from Osmania University and the International Institute of Information Technology, Hyderabad. His research interests include data science and analytics, data mining, machine learning, e-contracts, databases, and workflow systems. Contact him at email@example.com.