Artificial Intelligence Powered Data Interpretation Supports Decision Making and Improves Health Programmes

Using AI tools such as ChatGPT could help us move towards Equitable Health and Development for all

Some examples are shown below: 

Unlocking the Power of Data to Improve Health


Healthcare systems in low- and middle-income countries (LMICs) struggle to make sense of vast amounts of data due to scarcity of analytical skills. By using new Artificial Intelligence (AI) technology like ChatGPT, we aim to bridge this data interpretation gap, thereby supporting decision-making and strengthening health care systems.

We used ChatGPT-4 to analyse and interpret routine health programme data, comparing AI-generated outputs with traditional analysis and statistical software. Working closely with the South African Department of Health, we developed an AI-supported data-analysis ‘cheat sheet’ (available to download below) to scale-up this skill.

Embracing ChatGPT as a tool to assist with complex data analysis is a big step forward in improving how data is understood and utilised, especially in LMICs. The AI-toolkit can be adapted and used for any data interpretation in any sector. While individuals using the toolkit must have an in-depth understanding of the programme, they do not need to be specialists in data analytics. 

Hardware and software requirements:

  1. A functional computer, either laptop or desktop.
  2. A reliable and uninterrupted internet connection
  3. A web browser compatible with the ChatGPT platform.

Subscription to ChatGPT-4:

  1. Secure a subscription to ChatGPT-4. As of the last update, the subscription rate is 20USD/month.
  2. To subscribe, visit

Configuration and activation:

  1. Once logged in, navigate to Settings.
  2. Proceed to Settings and Beta.
  3. Under Beta features, select and activate the Advanced data analysis option to enable the code interpreter.

Small data sets or simple tables:

  1. Directly copy and paste the dataset into ChatGPT chat window.

Large data sets:

  1. For large data sets or multiple files, use the “upload file (+)” situated within the chat interface.
  2. The system allows for the simultaneous upload of multiple datasets or multiple workbooks within a single dataset.

Supported file formats:

  1. ChatGPT-4 supports a range of data types, including but not limited to CSV, Excel, PDF and dta.
  2. For large data sets or multiple files, use the “upload file (+)” situated within the chat interface.
  3. The system allows for the simultaneous upload of multiple datasets or multiple workbooks within a single dataset.

Using the prompt sheet:

Refer to the provided prompt sheet in the toolkit for specific data analysis instructions. This sheet is curated to offer guidance tailored to ChatGPT-4’s capabilities.

Data description and cleaning:

Utilize the prompts specified for data description and cleaning

Data analysis:

The prompt sheet offers tailored prompts for both patient-level and cumulative dataset analysis.

Merging datasets:

Refer to the prompt sheet for instructions on merging various data sets.

Data visualization:

Utilize the visualization prompts from the prompt sheet to generate insightful charts and graphs.

Generating reports and insights:

The prompt sheet provides prompts for generating analysis insights, comprehensive reports, summaries, and presentation materials.


  1. Ensure that sensitive data is anonymized or encrypted before uploading for analysis. ChatGPT-4 respects user privacy, but it’s always a good practice to maintain data confidentiality.

  2. Regularly check the OpenAI platform for updates or changes in features and functionalities.

Rank headcount column by district name, display results in a table.

Calculate Proportions of the following; death rate, loss to follow-up rate and successful completion. Display results in a table and show proportions as % show both raw numbers and proportions in the results table.

Calculate Proportion of patients screened for TB by district name.  Display results in a table

Rank number of patients screened for TB column by district name, please  display results in a table.

Calculate proportion of patients screened for TB from head count. Tabulate head count, number of patients screened for TB and TB screening proportion as a percentage in one table

Plot bar graph for head count and patients screened for TB. Add TB screening proportion on secondary axis and a line graph

Tabulate the number of patients screened for TB, number TB symptomatic and the proportion symptomatic patients.

What is the percentage breakdown of the XXXX column

Can you segment data and create a table?

What are the trends shown in the dataset

Calculate percentage values for the following variables: Screen for TB using Head count as a denominator. For subsequent variables including Screen for TB, TB symptomatic, TB investigation done, DS-TB confirmed and DS-TB treatment start use preceding variables as denominators

Analyze the following columns: On BDQ, On short regimen, Previous Drug History. Display results in a table for each variable

Is there a significant difference in age at treatments tart by gender

Compare On short regimen by gender and indicate if the is a statistical difference. Display results in a table

Provide HIV status distribution by gender and district, include proportions and total row.  Display results in a table for each variable

Create a variable called “treatment success” by adding cured and treatment completed

Display treatment success by: gender, district and HIV status in a graph for each variable

Calculate number of days between current treatment start date and outcome date

Display the “Days Between Treatment” column median, 25% percentile, 75% percentile for: overall, gender, HIV status, On BDQ, On short regimen. Create a box plot and table for each variable

Create a new column called finaltreatment_outcomes2. Which contains variables died and not died. For those not died, include only: treatment failure, treatment completed, transfer out, shared care, moved out, loss to follow-up, cured. Include total row

Calculate odds ratio for gender and finaltreatment_outcomes2. Show results in a table

Calculate and add odds ratios and p-values for each variable and display all results in a table, show individual contingency tables.

Interpret results and provide a simple explanation

Create a multivariate logistic regression model using death as an independent variable. Use the following as dependent variables; previous drug history and HIV status. Odds ratios and p-values for each variable and display all results in a table, show individual contingency tables.

Please add RR and 95%CI, display results in a table

Please display a forest graph for the results above

Determine the predictors of death. Show all results in a table. Add rr and 95%ci.

Provide a detailed STATA profile for all the work done


Merge two datasets namely, 22Q1-23Q1 and 22Q2-Q4. Column patientname is common to both datasets. Please assign a random number to each unique patientname.

Do not merge yet, I will indicate when to do that

Please indicate number of entries in each dataset. Display results in a table.

Please describe each dataset

Clean both data sets

Please put results onto the previous table

Display table of identified duplicates using patientname or random_number.

Please provide only overall number of duplicates identified using the following: patient name only, random_number only, patientname and random_number. Display results in a table include total number of entries

Drop patientname column in both datasets. Keep a single copy of unique random_number on both datasets. Display results in a table, number of duplicates and unique random_number for each dataset

Save datasets as; 22Q1-23Q1 (D) and 22Q2-Q4 (D)

Please merge Line list dataset onto EDRWeb dataset using the random_number column. Keep only the merged dataset. Display overall entries and the number merged for both datasets in a table

Please keep only the 796 with entries from both datasets and save as Merged_dataset. Ta !

Display the following columns; district, facility, gender and age are a single table for EDRWeb and Line list deduplicated datasets side-by-side. Display all results in a table

Provide a detailed STATA profile for all the work done