• IEEE.org
  • IEEE CS Standards
  • Career Center
  • About Us
  • Subscribe to Newsletter

0

IEEE
CS Logo
  • MEMBERSHIP
  • CONFERENCES
  • PUBLICATIONS
  • EDUCATION & CAREER
  • VOLUNTEER
  • ABOUT
  • Join Us
CS Logo

0

IEEE Computer Society Logo
Sign up for our newsletter
IEEE COMPUTER SOCIETY
About UsBoard of GovernorsNewslettersPress RoomIEEE Support CenterContact Us
COMPUTING RESOURCES
Career CenterCourses & CertificationsWebinarsPodcastsTech NewsMembership
BUSINESS SOLUTIONS
Corporate PartnershipsConference Sponsorships & ExhibitsAdvertisingRecruitingDigital Library Institutional Subscriptions
DIGITAL LIBRARY
MagazinesJournalsConference ProceedingsVideo LibraryLibrarian Resources
COMMUNITY RESOURCES
GovernanceConference OrganizersAuthorsChaptersCommunities
POLICIES
PrivacyAccessibility StatementIEEE Nondiscrimination PolicyIEEE Ethics ReportingXML Sitemap

Copyright 2025 IEEE - All rights reserved. A public charity, IEEE is the world’s largest technical professional organization dedicated to advancing technology for the benefit of humanity.

  • Home
  • /Publications
  • /Tech News
  • /Research
  • Home
  • / ...
  • /Tech News
  • /Research

Boa Infrastructure and Limitations: Exploring a Dataset of Mature Python Projects

By IEEE Computer Society Team on
April 17, 2023

Boa Meets PythonBoa Meets PythonSumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan, all from the Department of Computer Science at Iowa State University, dove into the creation of a dataset of 1,558 mature GitHub projects written in Python for data science tasks. The dataset is made available through the Boa infrastructure, and it includes metadata and code, as well as a diverse set of machine learning libraries managed by different users and organizations.

The purpose of this dataset is to enable mining software repository research, improve language design, and make library enhancements. The researchers explain various aspects of Boa infrastructure and its domain-specific language for program analysis queries, as well.

Collecting and Preprocessing Data

The process of collecting and preprocessing a dataset of Python projects from GitHub can be optimized for research on Mining Software Repositories (MSR) for data science software. As mentioned above, the dataset includes 1,558 repositories, which have been developed by 9,839 developers and contain projects that use at least 33 data science libraries.

The researchers go on to explain the data source, data collection and preprocessing, data generation, mapping Python AST to Boa AST, as well as data storage. In addition, they highlight the metrics of the dataset, including the top-rated projects and the number of developers and data science libraries in the dataset.


Want More Tech News? Subscribe to ComputingEdge Newsletter Today!


Convenient Access and Analysis

The researchers built a dataset that contains Python code for data science projects. This dataset can be accessed through a web interface or outside the Boa infrastructure, and there are several potential applications of the dataset. For example, you can use it to analyze the use of APIs in the code.

Limitations of the Dataset

The limitations of the dataset are also discussed. Some limitations include the following:

  1. The collected projects may not be representative of all data science projects.
  2. The selected repositories may not be mature enough.
  3. The accuracy of the dataset can be affected by the reliability of the Python grammar used to parse the programs.
  4. The dataset contains 1,558 repositories, which may not be sufficient for comprehensive analysis.
  5. The dataset focuses on data science libraries and may not be suitable for analyzing other aspects of software development.
  6. To access and analyze the dataset, users need to use the Boa infrastructure or write MapReduce tasks from scratch to analyze the data.
  7. The dataset lacks contextual information, such as the purpose of each repository, the problem it solves, and the intended audience.

The creation of a dataset of 1,558 mature GitHub projects written in Python for data science tasks by researchers at Iowa State University is a valuable resource. This is especially true when it comes to the mining of software repository research and improving language design and library enhancements. The dataset is made available through the Boa infrastructure, which provides a convenient way to access and analyze it.

However, the researchers acknowledge the limitations of the dataset, including its representativeness, accuracy, and lack of contextual information. Despite these limitations, the dataset offers potential applications for analyzing the use of APIs in the code, and it provides a starting point for future research in data science software development.

To learn more, read the full article today.

LATEST NEWS
Quantum Insider Session Series: Practical Instructions for Building Your Organization’s Quantum Team
Quantum Insider Session Series: Practical Instructions for Building Your Organization’s Quantum Team
Beyond Benchmarks: How Ecosystems Now Define Leading LLM Families
Beyond Benchmarks: How Ecosystems Now Define Leading LLM Families
From Legacy to Cloud-Native: Engineering for Reliability at Scale
From Legacy to Cloud-Native: Engineering for Reliability at Scale
Announcing the Recipients of Computing's Top 30 Early Career Professionals for 2025
Announcing the Recipients of Computing's Top 30 Early Career Professionals for 2025
IEEE Computer Society Announces 2026 Class of Fellows
IEEE Computer Society Announces 2026 Class of Fellows
Read Next

Quantum Insider Session Series: Practical Instructions for Building Your Organization’s Quantum Team

Beyond Benchmarks: How Ecosystems Now Define Leading LLM Families

From Legacy to Cloud-Native: Engineering for Reliability at Scale

Announcing the Recipients of Computing's Top 30 Early Career Professionals for 2025

IEEE Computer Society Announces 2026 Class of Fellows

MicroLED Photonic Interconnects for AI Servers

Vishkin Receives 2026 IEEE Computer Society Charles Babbage Award

Empowering Communities Through Digital Literacy: Impact Across Lebanon

FacebookTwitterLinkedInInstagramYoutube
Get the latest news and technology trends for computing professionals with ComputingEdge
Sign up for our newsletter