Sumon Biswas, Md Johirul Islam, Yijia Huang, and Hridesh Rajan, all from the Department of Computer Science at Iowa State University, dove into the creation of a dataset of 1,558 mature GitHub projects written in Python for data science tasks. The dataset is made available through the Boa infrastructure, and it includes metadata and code, as well as a diverse set of machine learning libraries managed by different users and organizations.
The purpose of this dataset is to enable mining software repository research, improve language design, and make library enhancements. The researchers explain various aspects of Boa infrastructure and its domain-specific language for program analysis queries, as well.
Collecting and Preprocessing Data
The process of collecting and preprocessing a dataset of Python projects from GitHub can be optimized for research on Mining Software Repositories (MSR) for data science software. As mentioned above, the dataset includes 1,558 repositories, which have been developed by 9,839 developers and contain projects that use at least 33 data science libraries.
The researchers go on to explain the data source, data collection and preprocessing, data generation, mapping Python AST to Boa AST, as well as data storage. In addition, they highlight the metrics of the dataset, including the top-rated projects and the number of developers and data science libraries in the dataset.
Want More Tech News? Subscribe to ComputingEdge Newsletter Today!
Convenient Access and Analysis
The researchers built a dataset that contains Python code for data science projects. This dataset can be accessed through a web interface or outside the Boa infrastructure, and there are several potential applications of the dataset. For example, you can use it to analyze the use of APIs in the code.
Limitations of the Dataset
The limitations of the dataset are also discussed. Some limitations include the following:
- The collected projects may not be representative of all data science projects.
- The selected repositories may not be mature enough.
- The accuracy of the dataset can be affected by the reliability of the Python grammar used to parse the programs.
- The dataset contains 1,558 repositories, which may not be sufficient for comprehensive analysis.
- The dataset focuses on data science libraries and may not be suitable for analyzing other aspects of software development.
- To access and analyze the dataset, users need to use the Boa infrastructure or write MapReduce tasks from scratch to analyze the data.
- The dataset lacks contextual information, such as the purpose of each repository, the problem it solves, and the intended audience.
The creation of a dataset of 1,558 mature GitHub projects written in Python for data science tasks by researchers at Iowa State University is a valuable resource. This is especially true when it comes to the mining of software repository research and improving language design and library enhancements. The dataset is made available through the Boa infrastructure, which provides a convenient way to access and analyze it.
However, the researchers acknowledge the limitations of the dataset, including its representativeness, accuracy, and lack of contextual information. Despite these limitations, the dataset offers potential applications for analyzing the use of APIs in the code, and it provides a starting point for future research in data science software development.
To learn more, read the full article today.