1. Best practices for data sharing: GitHub, Zenodo, NCBI
Sharing of raw data and code used for its analysis has become a standard requirement for publication in high-end scientific journals. Still, not all data can be publicly shared, and there is an ongoing debate on what proportion of data should be published and how. Recently, platforms such as Zenodo and FigShare have become a popular instrument for publication of both data and code, and now some journals even require posting to these platforms.
2. Pipeline development: Nextflow, Snakemake, or bash?
Analysis of omics datasets usually involves several consecutive steps. Such a series of steps is called a bioinformatic pipeline. For the past several years, multiple languages have been developed for construction and execution of such pipelines, with Nextflow, Common Workflow Language (CWL), and Snakemake among the most popular ones. There is, however, no universal standard for development of pipelines, and many researchers stick with bash scripting for their purposes.
3. Avoiding software dependency hell: virtual environments and conda
Virtually every person with an experience in bioinformatic data analysis faced the issue of software dependency conflicts. The problems is more than a simple inconvenience during installation and execution of a particular tool, as different versions of dependencies may affect the behavior of the software and the reproducibility of results. Several years ago, conda gained significant popularity in the bioinformatics community for package management and creation of virtual environments..
4. Downstream analysis with R Markdown and Jupyter
While accurate processing of raw high-throughput datasets is important, it is the downstream analysis of the data which is crucial for making the correct conclusions. R Markdown and Jupyter Notebooks are the two main tools for making the downstream analysis of data in R or Python clean and reproducible. However, it may be challenging to organize the Markdown files or Jupyter notebooks with clean code and proper file structure.
5. Cloud computing for computational biology
One of the main recent trends in computational biology and bioinformatics is the rising popularity of various cloud-based solutions. First, more and more people are using cloud computing resources, such as Google Cloud, Yandex Cloud, or Amazon Web Services, both for data analysis and development/deployment of bioinformatic tools. Furthermore, there are now multiple platforms that provide interactive user interfaces and software collections for different analysis types
6. Virtualization and containerization
Management of computational resources is an important topic for any bioinformatician, either on local or remote infrastructure. Indeed, pipelining languages and other technologies discussed on our meetings greatly aid in this task. However, it is important to discuss the benefits of using other tools, including containerization, for more efficient data analysis in computational biology.