BioConductor is an R Package “framework’ (or sub-environment) for bioinformatics; it has a large-scale peer-review system for packages, and any package that is submitted must use and "blend" specific structural components. More importantly, BioConductor has a Discussion Forum (akin to StackOverflow) but specific to BioConductor. The main diffierence of BioC with CRAN (in R programming) is that all packages contributing to BioConductor are somewhat related, must be installed using specific methods, and must leverage BioConductor structure. In that sense, it’s somewhat similar to PyTorch. However, while PyTorch is for generic DeepLearning, BioConductor works only on bioinformatics (to analyse biological data). In this project you will be using advanced, state-of-the-art natural language processing (NLP) algorithms to analyse BioConductor's discussion forum according to the software releases--meaning, you will be linking your topic-modelling to software version releases. After modelling, you will be expected ccomplete manual perusals of the data, to draw conclusions, and provide visualisations and a report.
Note:
- Knowledge of bioinformatics is not required. You may need to read the discussion forums to label the resulting topics, but that's it.
- Knowledge of R programming is not required. If you know your progrmaming theory, you can translate that knowledge to any programming language. And that is enough for this study.
- You will need to follow instructions and be systematic in your methodology. You will not be using traditional topic-modelling algorithms.
The project will be co-supervised with my colleague Dr Fatemeh Fard (from the University of British Columbia, Canada), whose work in NLP applied to software engineering and data science is quite novel. Meetings are done mostly with me, but at least once per month with her.
This project has been completed in S2, 2022.