• IEEE.org
  • IEEE CS Standards
  • Career Center
  • About Us
  • Subscribe to Newsletter

0

IEEE-CS_LogoTM-orange
  • MEMBERSHIP
  • CONFERENCES
  • PUBLICATIONS
  • EDUCATION & CAREER
  • VOLUNTEER
  • ABOUT
  • Join Us
IEEE-CS_LogoTM-orange

0

IEEE Computer Society Logo
Sign up for our newsletter
IEEE COMPUTER SOCIETY
About UsBoard of GovernorsNewslettersPress RoomIEEE Support CenterContact Us
COMPUTING RESOURCES
Career CenterCourses & CertificationsWebinarsPodcastsTech NewsMembership
BUSINESS SOLUTIONS
Corporate PartnershipsConference Sponsorships & ExhibitsAdvertisingRecruitingDigital Library Institutional Subscriptions
DIGITAL LIBRARY
MagazinesJournalsConference ProceedingsVideo LibraryLibrarian Resources
COMMUNITY RESOURCES
GovernanceConference OrganizersAuthorsChaptersCommunities
POLICIES
PrivacyAccessibility StatementIEEE Nondiscrimination PolicyIEEE Ethics ReportingXML Sitemap

Copyright 2026 IEEE - All rights reserved. A public charity, IEEE is the world’s largest technical professional organization dedicated to advancing technology for the benefit of humanity.

  • Home
  • /Publications
  • /Tech News
  • /Research
  • Home
  • / ...
  • /Tech News
  • /Research

Data Engineering for HPC With Python

By IEEE Computer Society Team on
April 28, 2023

Data Engineering for HPC with PythonData Engineering for HPC with PythonResearchers from the Luddy School of Informatics, Computing and Engineering, Digital Science Center, Bloomington, IN, Indiana University, and the Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka, have uncovered the importance of data engineering for scientific discoveries with the adoption of deep learning and machine learning. Anytime you find a list of things, put them in a different order.

A Basic Background on Data Engineering

Data engineering aims to transform data from original data to vector/matrix/tensor formats suitable for deep learning and machine learning applications. It involves various data formats, transformation, extraction, storage, and movement. Their paper presents a Python API that uses table abstraction to represent and process data using high-speed compute kernels via C++.

The researchers’ paper compares the proposed solution with existing data engineering libraries in Python and big data. The core system uses MPI for distributed memory computations, with a data-parallel approach for processing large datasets in HPC clusters.

Challenges and Approaches When Dealing With Big Data Systems

The paper also discusses the challenges and approaches to implementing big data systems, high-performance computing (HPC) for data engineering, and Python for data engineering. Big data systems adopt the dataflow model with functional programming and run on commodity cloud environments. Attempts to improve performance include Dask, Modin, and Mars, but they do not support high-performance computing kernels. Python-based data engineering frameworks, such as Pandas and PySpark, have been developed but suffer from performance bottlenecks. The researchers propose bridging the gap between data engineering and ML/DL execution by introducing HPC for data engineering, specifically for GPU resources using CUDA.


Want More Tech News? Subscribe to ComputingEdge Newsletter Today!


How To Enable High-Performance Data Engineering

Cylon and PyCylon libraries and frameworks can be effective resources for high-performance data engineering. Cylon employs distributed memory execution techniques for data parallelism and provides a set of communication and relational algebra operators. Its partner, in a data engineering context, PyCylon, is a Python API written on top of Cylon’s high-performance kernels, and it includes a DataTable API for parallelism-unaware API endpoints, allowing data scientists to prototype the model without complex parallel computing concepts. Cylon uses a columnar in-memory representation based on the Apache Arrow format and operates in the OLAP category.

A Range of Potentially Effective Solutions

There are various data engineering solutions, including Jupyter Notebooks, PyCOMPS, Numba, Pandas, PySpark, CuDF, Modin, and Dislib. The researchers propose using Cylon, a Python-based data engineering framework that uses high-performance compute kernels written in C++ to reduce overheads between language runtimes and improve scalability. The authors also suggest expanding support for diverse data formats, improving memory and network utilization, and integrating with communication technologies like UCX. Future work includes developing a DataFrame API based on Modin and supporting GPFS and Lustre file systems.

Data engineering can play a critical role in scientific discoveries through the adoption of machine learning and deep learning. The researchers propose the use of Cylon and PyCylon libraries for high-performance data engineering and suggest expanding support for diverse data formats, integrating with communication technologies, and improving memory and network utilization. The findings of this study provide valuable insights into the challenges and approaches to implementing big data systems and offer a range of potentially effective solutions for data engineering. These contributions will help advance the field of data engineering and enable more efficient and effective scientific discoveries.

Check out the full article to dig a little deeper into the implications of machine and deep learning for the advancement of science.
LATEST NEWS
IEEE CS High-Performance Computing Conference SC Recognized as Fastest Growing Event in 2025
IEEE CS High-Performance Computing Conference SC Recognized as Fastest Growing Event in 2025
ASTRA 2025: Neuroimaging, Brain-Computer Interfaces, and AI
ASTRA 2025: Neuroimaging, Brain-Computer Interfaces, and AI
IEEE Computer Society Launches Software Professional Certification
IEEE Computer Society Launches Software Professional Certification
IEEE LCN 2025: Promoting Sustainability and Carbon Neutrality
IEEE LCN 2025: Promoting Sustainability and Carbon Neutrality
CS Juniors: Girls.comp Day
CS Juniors: Girls.comp Day
Read Next

IEEE CS High-Performance Computing Conference SC Recognized as Fastest Growing Event in 2025

ASTRA 2025: Neuroimaging, Brain-Computer Interfaces, and AI

IEEE Computer Society Launches Software Professional Certification

IEEE LCN 2025: Promoting Sustainability and Carbon Neutrality

CS Juniors: Girls.comp Day

The Stylist in the Machine: Shipping a Day-1 Fashion Recommender with LLMs

LinkedIn Profile Template

Quantum Insider Session Series: Choosing the Right Time and Steps to Start Working with Quantum Technologies

Get the latest news and technology trends for computing professionals with ComputingEdge
Sign up for our newsletter