User requirements elicitation study for ALPHA network data provenance documentation browsing software - User Guide

Permanent Identifier:

Data Description

The ALPHA network is an innovative secondary data analysis program aimed at improving our understanding of the HIV epidemiology. ALPHA is coordinated by its secretariat in the Department of Population Health (DPH) under the Faculty of Epidemiology and Population Health at the London School of Hygiene and Tropical Medicine. It comprises of 10 autonomous research institutions sharing similar interests in HIV Epidemiology. Each institution has its own research agenda and data management system. All partners pre-date the network formation. They all have population/community-based longitudinal demographic and HIV surveillance data.

ALPHA leverages the benefits of data pooling - Better statistical power gained by bringing together data from a number of research institutions and a wider perspective not possible to achieve with one research institution.

ALPHA data and “modus operandi”

ALPHA assembles datasets on various topics related to demographic and HIV surveillance. These data are referred to as ALPHA data specifications or data specs and are described on the ALPHA metadata page. The ALPHA data specs have a well-defined structure to which each partner of the network has to transform their data. ALPHA is organised around data analysis and HIV research capacity strengthening workshops. At the workshops, partners bring their data and are involved in data analysis training addressing research questions of interest for the particular workshop.

Data harmonisation in ALPHA

ALPHA is working on a project to produce a sharable set of harmonised data that combines both population-based and clinic data from the partner studies with funding from the Wellcome Trust.

Whilst community-based cohorts and demographic surveillance systems provide a rich source of data, use of the data is often limited because successful analysis requires detailed knowledge of the study's contemporary and historical procedures and of data management practices.  To date the ALPHA Network has successfully extracted and harmonised 10 standard data tables from the partner studies. However, these data are still complex and require considerable prior knowledge to use effectively, which in practice means the data can only be used in collaboration with one of the ALPHA staff.

The main project combines a number of activities among them:

  1. Using industry standard data integration methods, and a bespoke data appliance Centre in a Box to develop a robust process for deriving the ALPHA datasets.
  2. High-quality documentation of both the data and the processes used to derive the data.

This data collection resulted from a study relating to the second activity on data documentation. It contains qualitative data collected as part of scoping work to establish domain experts’ perspectives on the functionality that a user-friendly metadata browser for ALPHA datasets should provide. It contains transcripts of 10 semi-structured Skype interviews conducted with individual researchers and data managers affiliated to the ALPHA network and the Cohort & Longitudinal Studies Enhancement Resources (CLOSER) project. Interviews explored proposed features of the metadata browser, including: provision for viewing all tasks performed in the process of creating a dataset, browsing the steps in each task, task purpose, related concepts, related code scripts, association between a sub-task and its input data and outputs and provision for viewing data structure.

Data Collection Methods

A convenience sample of 10 participants was drawn from data managers and researchers affiliated to the ALPHA and CLOSER projects. These two groups of users, were considered suitable for identifying the requirements of both internal and external users. All but one interviewee had at least a master’s degree and 5 years work experience.

The data collection consisted of background material reading and a recorded Skype interview. An information pack was emailed to the study participants prior to the interview. In this pack there were the following items: (1) a study background document, (2) an information sheet, (3) a consent form, and (4) a question guide comprising of the 6 mock-up diagrams of the proposed features and accompanying questions.

Skype interviews

Each participant was interviewed over Skype on the features in the mock-up diagrams using the semi-structured question guide. The participants graded each feature’s importance on a provided scale and gave the rationale for their grading. Further, they listed any desired features not included in the mock-ups.

Data Analysis and Preparation

All the interviews were recorded and transcribed verbatim, checked and cleaned by the lead researcher.

Geographic regions

Southern and Eastern Africa and United Kingdom.

Key dates

Quality Controls

The lead researcher transcribed the audio files from the interviews, checked the transcripts and cleaned them as needed with the support of software developers in the research team.


Human population


Names of participants and other identifying information such as place names were removed and replaced with pseudonyms.

All participants gave their permission for the transcripts to be archived in an anonymised form for use in future research.


LSHTM ethics ref: 16429


ALPHA, provenance, metadata, data harmonisation, requirements elicitation

Language of written material


Project title

PhD Thesis: Provenance of “after the fact” harmonised community-based demographic and HIV surveillance data from ALPHA cohorts.

Additional Information

Interview transcripts produced during this study are embargoed until 30/06/2020 to enable sufficient anonymisation of the data. Subsequent access may be granted for secondary analysis for other purposes.

All other accompanying files are public access.

Data Creators

Forename Surname Faculty / Dept Institution Role
Chifundo Kanjala Population Health LSHTM Data Creator
Arofan Gregory   DDI Alliance Co-Investigator
Jay Greenfield   DDI Alliance Co-Investigator
Emma Slaymaker Population Health LSHTM PhD Supervisor
Jim Todd Population Health LSHTM PhD Supervisor

Associated Roles

Forename Surname Faculty / Dept Institution Role
Gareth Knight   LSHTM PhD Advisor
Tito Castillo   Guy's and St Thomas'? NHS Foundation Trust PhD Advisor
David Beckles   Independent IT Consultant PhD Advisor

File Description

Filename Description Access status Licence Embargo period Compressed archive contains: 10 anonymised interview transcripts provided in MS Word Request access Data Sharing Agreement 2020-07-01
CiBDoS_Research_Protocol CiBDoS_Research_Protocol Open Creative Commons Attribution (CC-BY)  
CiBDoS_Requirements_Background Centre in a Box data documentation (CiBDoS) software requirements elicitation study - Background information Open Creative Commons Attribution (CC-BY)  
CiBDoS_Requirements_Questionnaire Question guide template for the Centre in a Box software requirements elicitation study Open Creative Commons Attribution (CC-BY)