• IEEE.org
  • IEEE CS Standards
  • Career Center
  • About Us
  • Subscribe to Newsletter

0

IEEE
CS Logo
  • MEMBERSHIP
  • CONFERENCES
  • PUBLICATIONS
  • EDUCATION & CAREER
  • VOLUNTEER
  • ABOUT
  • Join Us
CS Logo

0

IEEE Computer Society Logo
Sign up for our newsletter
IEEE COMPUTER SOCIETY
About UsBoard of GovernorsNewslettersPress RoomIEEE Support CenterContact Us
COMPUTING RESOURCES
Career CenterCourses & CertificationsWebinarsPodcastsTech NewsMembership
BUSINESS SOLUTIONS
Corporate PartnershipsConference Sponsorships & ExhibitsAdvertisingRecruitingDigital Library Institutional Subscriptions
DIGITAL LIBRARY
MagazinesJournalsConference ProceedingsVideo LibraryLibrarian Resources
COMMUNITY RESOURCES
GovernanceConference OrganizersAuthorsChaptersCommunities
POLICIES
PrivacyAccessibility StatementIEEE Nondiscrimination PolicyIEEE Ethics ReportingXML Sitemap

Copyright 2025 IEEE - All rights reserved. A public charity, IEEE is the world’s largest technical professional organization dedicated to advancing technology for the benefit of humanity.

  • Home
  • /Publications
  • /Tech News
  • /Research
  • Home
  • / ...
  • /Tech News
  • /Research

Boa Infrastructure and Limitations: Exploring a Dataset of Mature Python Projects

By IEEE Computer Society Team on
April 17, 2023

Boa Meets PythonBoa Meets PythonSumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan, all from the Department of Computer Science at Iowa State University, dove into the creation of a dataset of 1,558 mature GitHub projects written in Python for data science tasks. The dataset is made available through the Boa infrastructure, and it includes metadata and code, as well as a diverse set of machine learning libraries managed by different users and organizations.

The purpose of this dataset is to enable mining software repository research, improve language design, and make library enhancements. The researchers explain various aspects of Boa infrastructure and its domain-specific language for program analysis queries, as well.

Collecting and Preprocessing Data

The process of collecting and preprocessing a dataset of Python projects from GitHub can be optimized for research on Mining Software Repositories (MSR) for data science software. As mentioned above, the dataset includes 1,558 repositories, which have been developed by 9,839 developers and contain projects that use at least 33 data science libraries.

The researchers go on to explain the data source, data collection and preprocessing, data generation, mapping Python AST to Boa AST, as well as data storage. In addition, they highlight the metrics of the dataset, including the top-rated projects and the number of developers and data science libraries in the dataset.


Want More Tech News? Subscribe to ComputingEdge Newsletter Today!


Convenient Access and Analysis

The researchers built a dataset that contains Python code for data science projects. This dataset can be accessed through a web interface or outside the Boa infrastructure, and there are several potential applications of the dataset. For example, you can use it to analyze the use of APIs in the code.

Limitations of the Dataset

The limitations of the dataset are also discussed. Some limitations include the following:

  1. The collected projects may not be representative of all data science projects.
  2. The selected repositories may not be mature enough.
  3. The accuracy of the dataset can be affected by the reliability of the Python grammar used to parse the programs.
  4. The dataset contains 1,558 repositories, which may not be sufficient for comprehensive analysis.
  5. The dataset focuses on data science libraries and may not be suitable for analyzing other aspects of software development.
  6. To access and analyze the dataset, users need to use the Boa infrastructure or write MapReduce tasks from scratch to analyze the data.
  7. The dataset lacks contextual information, such as the purpose of each repository, the problem it solves, and the intended audience.

The creation of a dataset of 1,558 mature GitHub projects written in Python for data science tasks by researchers at Iowa State University is a valuable resource. This is especially true when it comes to the mining of software repository research and improving language design and library enhancements. The dataset is made available through the Boa infrastructure, which provides a convenient way to access and analyze it.

However, the researchers acknowledge the limitations of the dataset, including its representativeness, accuracy, and lack of contextual information. Despite these limitations, the dataset offers potential applications for analyzing the use of APIs in the code, and it provides a starting point for future research in data science software development.

To learn more, read the full article today.

LATEST NEWS
From Isolation to Innovation: Establishing a Computer Training Center to Empower Hinterland Communities
From Isolation to Innovation: Establishing a Computer Training Center to Empower Hinterland Communities
IEEE Uganda Section: Tackling Climate Change and Food Security Through AI and IoT
IEEE Uganda Section: Tackling Climate Change and Food Security Through AI and IoT
Blockchain Service Capability Evaluation (IEEE Std 3230.03-2025)
Blockchain Service Capability Evaluation (IEEE Std 3230.03-2025)
Autonomous Observability: AI Agents That Debug AI
Autonomous Observability: AI Agents That Debug AI
Disaggregating LLM Infrastructure: Solving the Hidden Bottleneck in AI Inference
Disaggregating LLM Infrastructure: Solving the Hidden Bottleneck in AI Inference
Read Next

From Isolation to Innovation: Establishing a Computer Training Center to Empower Hinterland Communities

IEEE Uganda Section: Tackling Climate Change and Food Security Through AI and IoT

Blockchain Service Capability Evaluation (IEEE Std 3230.03-2025)

Autonomous Observability: AI Agents That Debug AI

Disaggregating LLM Infrastructure: Solving the Hidden Bottleneck in AI Inference

Copilot Ergonomics: UI Patterns that Reduce Cognitive Load

The Myth of AI Neutrality in Search Algorithms

Gen AI and LLMs: Rebuilding Trust in a Synthetic Information Age

FacebookTwitterLinkedInInstagramYoutube
Get the latest news and technology trends for computing professionals with ComputingEdge
Sign up for our newsletter