Streamlining Bioinformatics Data Pipelines with Omnipy
Date: 12 December 2024 @ 13:00 - 16:00
Timezone: Brussels
Duration: 3 hours
Language of instruction: English
The time is ripe to challenge fundamental assumptions on managing bioinformatics data. With good reason, the notion of piping together command line tools as workflows has long dominated bioinformatics. However, returning to flat files on the command line at each junction adds parsing and serialisation steps that are unnecessary and often change the data in subtle (or less subtle) ways that are hard to fix. With Omnipy, we instead follow the lead of ETL (Extract, Transform and Load) solutions for Big Data – streamlined as an elegant and powerful Python library. Omnipy offers a systematic and scalable approach to building pipelines where the data is in focus, more than the tools. Through the specification of data models/parsers, Omnipy allows researchers to import data in various formats and wrangle the data through stepwise transformations. For automation with large data, Omnipy seamlessly scales up for deployment on remote infrastructures. The workshop will introduce the Omnipy library and its main concepts. This will take form as a mix of presentations and hands-on exercises. While Omnipy is designed for cross-domain applicability, the primary use cases are bioinformatics-related. We will thus make use of real-world examples that should feel relevant to many of the attendees. We have held similar workshops before, e.g. for Oslo Bioinformatics Week 2023, Digital Scholarship Days 2024 (part1 and part two) at UiO. Continuously improved over several years – not the least through feedback from workshop attendees – Omnipy is finally getting ready for its v1.0 release!
Contact: [email protected]
Venue: Ole-Johan Dahl's House, 23B Gaustadalléen
City: Oslo
Region: Oslo kommune
Country: Norway
Postcode: 0373
Prerequisites:
The participants should have some experience with Python programming/scripting. We will not spend time explaining basic syntax and concepts, other than what is related to type hints. Experience with type hints in Python is useful, but not required.
Learning objectives:
- Introduction to Python type hints and Pydantic models
- How to use type hints to define models, datasets, tasks and flows in Omnipy
- How to write a simple parser for a tabular file format
- How to set up an executable mapping of data from one metadata schema to another
- How to automate an Omnipy data pipeline by deploying it to the Prefect orchestrator on NIRD (National Infrastructure for Research Data)
Organizer: The workshop is provided by the Oslo node of ELIXIR Norway as part of an extended event organised by the Student Committee of the Centre for Bioinformatics at the University of Oslo in collaboration with the ISCB Regional Student group in Norway
Host institutions: University of Oslo
Eligibility:
- First come first served
Target audience: PhD, Postdoctoral Fellows, Technical personnel
Capacity: 20
Tech requirements:
Laptop, with an account set up for Google Colab
Cost basis: Free to all
Sponsors: UiO:Life Science, the Group for temporary employees at the Division of Laboratory Medicine (KLM TempAware), NCMM - Norsk senter for molekylærmedisin
Scientific topics: Data curation and archival, Data identity and mapping, Data quality management, Data governance, Workflows
Operations: Data handling
External resources:Activity log