Advanced Topic Modelling for BioInformatic Software Platform



External Member

Dr Fatemeh Fard (University of British Columbia, Canada)


BioConductor is an R Package “framework’ (or sub-environment) for bioinformatics; it has a large-scale peer-review system for packages, and any package that is submitted must use and "blend" specific structural components. More importantly, BioConductor has a Discussion Forum (akin to StackOverflow) but specific to BioConductor. The main diffierence of BioC with CRAN (in R programming) is that all packages contributing to BioConductor are somewhat related, must be installed using specific methods, and must leverage BioConductor structure. In that sense, it’s somewhat similar to PyTorch. However, while PyTorch is for generic DeepLearning, BioConductor works only on bioinformatics (to analyse biological data). In this project you will be using advanced, state-of-the-art natural language processing (NLP) algorithms to analyse BioConductor's discussion forum according to the software releases--meaning, you will be linking your topic-modelling to software version releases. After modelling, you will be expected ccomplete manual perusals of the data, to draw conclusions, and provide visualisations and a report.
  • Knowledge of bioinformatics is not required. You may need to read the discussion forums to label the resulting topics, but that's it.
  • Knowledge of R programming is not required. If you know your progrmaming theory, you can translate that knowledge to any programming language. And that is enough for this study.
  • You will need to follow instructions and be systematic in your methodology. You will not be using traditional topic-modelling algorithms.
The project will be co-supervised with my colleague Dr Fatemeh Fard (from the University of British Columbia, Canada), whose work in NLP applied to software engineering and data science is quite novel. Meetings are done mostly with me, but at least once per month with her.
This project has been completed in S2, 2022.


  • Programming knowledge. Python is a must. You may be required to read R code (read only, which with generic programming theory you can achieve).
  • Knowledge (or willingness to learn quickly) about using APIs to download data.
  • Demonstrated academic writing/speaking skills.
  • Excellent attention to details.
Please, contact me via email with a detailed resume, and your comments (1 page only) on why you are interested in on this project.

Background Literature

You can find Bioconductor here:


  • Empirical Software Engineering
  • Scientific Software / Data Science Software
  • Mining Software Repositories
  • Advanced Topic Modelling

Updated:  10 August 2021/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing