Commit cd0dc078 authored by Toby Hodges's avatar Toby Hodges
Browse files

added notes and preliminary course plan from meetings so far.

### Data Management course meeting
###### Jean-Karim's slides/content from a previous course from a couple of years ago
- what do you do with data over the lifetime of an experiment and afterwards
- from raw data to paper, and beyond
- preserve data integrity and effective comms
- some sort of tracking mechanism of data through intemediate steps to final product of an analysis
- keep things generic, not just for people who code/work on command line
- funding bodies are starting to ask for data management plan in proposals/applications
- need to make sure that anything in a paper lab book is also available in a computer-readable format
- documentation in general
- README for every project
- good practices in data processing
- examples
- filenames
- consistent, descriptive, no spaces
- judicious use of directory hierarchy
- discuss workflow with collaborators beforehand, to make sure that all can work with file formats that you plan to use
- tabular data, spreadsheets
- data storage
- backups backups backups
- plan adequate storage in advance
- noone ever modifies primary data
- check file integrity regularly
- relational databases for data management and workflow tracking/documentation
- links, info on large datafiles in database, rather than data itself, to prevent unmanageable expansion of database size
- option of a full LIMS
- can be developed in-house
- has the advantage of being "aware" of specifics of the project
- use of browser-based database management systems
- version control/tracking
- for result tracking (changes between different analysis parameters etc)
- data management checklist
###### Charles
- data duplication is a big problem here, with people not keeping track of which version of a file they used
- Another big problem is people not keeping track of analysis steps (software versions, parameters etc) (This is maybe a separate issue of experiment documentation?)
- Galaxy
- for command line, what does it mean to have a pipeline?
- snakemake
###### Preliminary Course Structure - Updated 2016-03-04
##### Day 1
- initial ice-breaker (Toby) _30 mins_
- ask people what "data management" means to them?
- why worry about data management anyway?
- who can you talk to at EMBL for more info about this stuff? (a bit of Bio-IT promo)
- what is data?
- what are the different roles (in terms of data) in a lab?
- divide up into groups
- based on those roles?
- with at least one rep of each role?
- what problems does a lab face in terms of data management? (if we have no PIs in the course, we ask someone in each group to pretend to be a PI)
- summarise/formalise (__blends with intro to Charles' section__)
- general problems of data organisation (Charles) _2 hrs_
- what is important for good data organisation?
- Charles datawarehouse
- storage/backing up/versioning (Jean-Karim) _2 hrs_
- what do you do to take care of this?
- ask the audience again - what do you do to backup? does anyone have a backing up horror story?
- people should plan for the costs of all of this in advance
- post-publication storage/data lifecycle
- documentation of data processing (Charles) _2 hrs_
- what do we mean by a pipeline?
- reproducibility
- how do you share, compare, etc
- example of live documentation of a command?
##### Day 2
- final wrap-up (Jean-Karim)
- tabular data/spreadsheets
- relational databases
- search strategies - how do you find what you're looking
- case studies @ EMBL
- backup strategies
- NGS projects at EMBL
- Mitocheck project/Chromosome condensation project/cellbase/high-throughput screen
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment