DATA FOR GOVERNANCE HACKATHON – TEAM OCTAVE

The Data for Governance Hackathon was envisioned to challenge participants to build an integrated analytics platform that will consolidate various disparate data sources to create a unified ‘analytical view’. With this view, independent entities will be analysed for patterns, relationships, commonalities, and triangulated insights generation to drive data-driven strategic decision-making on governance at various levels.  

Our platform was designed to serve as a ‘single version of the truth’ for data across all different (public and private) sectors – health, education, financial services, agriculture, etc.  However, given the size and scale of an overarching such data project, we decided to create a proof of concept demonstrated in form of a minimum viable product (MVP) using all the data set captured from the GRID platform.  

Our objective is to drive the synthetic use of data by expanding the visualization component of the GRID3. This will help the policymakers understand the distribution, patterns, and peculiarities among different segments of the populace at National, state and local government levels.

Our submission involves the implementation of standardized analytics dashboards that will provide visualizations and visibility, especially in regards to how advanced data analytics can enrich data governance at various levels. We explored the possibilities of integrating a diverse set of public data such as population estimate from GRID3 (Geo-Referenced Infrastructure and Demographic Data for Development) project, locations of hospitals, schools and market places in a bid to understand how these distributions affect governance. The granularity of these datasets from disparate sources were provided at local government level. The LGA granular detail enhanced the ability to correlate and implement referential integrity within the data sources.

 The overall aim was to draw deeper contextual insights and use the insights generated to offer bespoke solutions. The platform created, made data more discoverable, accessible, interoperable and reusable for industry stakeholders and interested parties. It was envisioned that the platform would grow to include additional domains such as agriculture, education, health, transportation, logistics, etc. in the later years. This document provided comprehensive documentation to explain the objective of the MVP and to demonstrate a new way of using dynamic analytic tools to analyse multiple data to present insights. This document explained the nature of the data sets provided, system architecture, tools and technologies, use-cases, and data governance framework adopted.

DATA

Sources and Data Types

Data sets for the MVP originated from three (5) different sources. These include GRID3, Urban planning, Flood incidences, pipeline vandalization, etc

GRID3 data provided population estimate at LGA level by combining micro-census / survey data with satellite imagery to predict population numbers in accessible areas. Model-based gridded estimates of Nigeria population density along with other information was used to determine a metric’s penetration at LGA level. Aggregating the penetration at LGA, state, regional and national levels provided a hierarchical view and deeper insights to the extent at which individuals at these locations.  GRID3 hybrid census was mainly estimates based on survey and geo-spatial data. GRID3 tables had states, LGA, age-band, gender and population estimate.

The GRID3 team provided additional data set that enriched the insights generated from the supply-side data. These include:

  1. All the schools (primary, secondary and tertiary institutions) in Nigeria
  2. All the registered medical facilities (Public and Private) in Nigeria
  3. All the Open Market places in Nigeria

The data provided included the GPS coordinates, LGA, States etc. of these facilities. The list of fields provided is as listed below.

Schools Medical Facilities Markets
Id id id
geometry.type geometry.type geometry.type
geometry.coordinates1 geometry.coordinates1 geometry.coordinates1
geometry.coordinates2 geometry.coordinates2 geometry.coordinates2
geometry_name geometry_name geometry_name
geometry_type latitude geometry_type
latitude longitude latitude
longitude ri_service_status longitude
global_id timestamp global_id
name cce_available name
category category settlement_name
sub_type global_id market_days_monday
management name market_days_tuesday
education functional_status market_days_wednesday
poi_type type market_days_thursday
ward_code ownership market_days_friday
source ward_code market_days_saturday
ward_name source market_days_sunday
lga_code ward_name total_market_days
lga_name lga_code market_type_goods
state_code lga_name ward_code
state_name state_code source
number_of_teachers state_name ward_name
number_of_students alternate_name lga_code
  contact_phone lga_name
  contact_name state_code
  cce_quantity state_name
  cce_lastupdated other_market_type_goods
  accessibility product_description
    settlement_type

Data Quality

Largely, the data provided was clean, especially, the transactional data sets. However, there were cases of data mismatch, where a local government expected to be in Lagos (say Surulere) was found in Imo state.   These cases constituted less than 2% of the entire data set. We curated the data by conducting a preliminary data quality assessment to ascertain the level of inconsistencies in the data. Some efforts taken to curate the data involved creating a lookup table which served as a reference point for the correct mappings of local government and states in the Federation.  

To maintain high data quality, data cleansing was performed to address incorrectness, inconsistency & incompleteness in the data set. In order to enhance the data quality, ‘aggressive’ data cleansing steps were followed to assure the accuracy and consistency of the performance by taking these steps:

  • Conducted preliminary data quality assessment
  • Determined the relevant and important fields for the purpose of data cleansing
  • Pulled all relevant fields into a separate temporary space 
  • Examined the health of the pulled fields in terms of their correctness, completeness, comprehensiveness and consistency of default values, and verified data integrity through either online research and eye-balling
  • Cleaned and treated the default values, incorrect Formats, patterned numbers, and segmented each variable term into 4 segments: ‘VALID’ ‘INVALID’ “DEFAULT’ ‘MISSING’
  • Checked the distribution and term frequencies of ‘Valid’ segments of each variable to further determine any inconsistent patterns

The data quality process ensured that key steps were fulfilled prior to processing, analysis and visualisation of results based on a number of quality criteria:

  • Completeness: Ensured all required data was collected. Data volume collected was monitored and compared to historical collections data volumes, this was to capture the anomalies within set threshold levels of tolerance. In addition, the completeness was verified from dimensions such as location, periods, product plans etc.
  • Integrity: Ensured only valid data was collected. This criterion verified the referential integrity i.e. orphan data or data without reference existed in the collected data and checked for abnormal distributions of values
  • Consistency: Ensured new data had the same characteristics as already processed data. Data values were verified to ensure their behaviour was in line with previous loads. For instance, checksums on specific fields should be within the same range as the past history.
  • Latency: Ensured data was made available in for processing within set limits of acceptance of time. Verify that production delays were respected and data is made available for collection within a specific threshold
  • Traceability: Ensured that procedures could be traced back for audit purposes.

TOOLS, TECHNIQUES AND TECHNOLOGIES

TOOLS & TECHNOLOGIES

  • Structured Query Language (SQL) is a domain-specific language that enabled us to create and operate on relational databases. It was used to explore and manipulate the data provided in the project.
  • The SQL server: Microsoft SQL Server was the relational database management system used for the project. We adopted SQL Server primary to store and retrieve data as required for this project. SQL Server provided a wide variety of transaction processing, business intelligence, functions and procedures for analytics applications.
  • Tableau: Tableau was used to design all the dashboards

USE CASES: Governance Map

Use Case 1: Comparison of Population Distribution across

The analysis was done on the distribution of agents against the adult population by region, state, & LGAs, fraud/theft, the distribution of amount involved in fraud/theft and complaint against the adult population by region, state, & LGA. Additionally, there was also a measure of central tendency distribution of agents, fraud/theft, the amount involved and complaint per state at LGA level

Use Case 2: Multi-Dimension Layers of Measure over Base Population

Identifying the social amenities, child population per school, population per medical facilities & agents per market distribution across regions, states & LGAs.

  1. Population Distribution Against Socio-amenities: Distribution of total population, schools, medical facilities and markets against schools, medical facilities & market places.
  2. Academic Facility Utility for Child Population: This gave the distribution of schools and child population across the country as well as the estimated proportion of the child population that can be serviced by the available primary, secondary, and tertiary schools.
  3. Medical Facility for Total Population:  This focused on medical facility utility with respect to the total population estimate in Nigeria. It also provided an estimate on the proportion of the population that can be serviced by the available primary, secondary and tertiary medical facilities.
  4. Market Utility for Agents and Total Population:  This focused on market utility with respect to the total population estimate and agents in Nigeria. It also provided the distribution of markets and population across the country as well as the estimated proportion of agents that can service the markets across local governments.

Use Case 3: Correlation Between Identifiable Individuals, Social Amenities at LGA Granulariry

Comparing population category across the country division with further analysis on identifying the correlation between the population category and social amenities as well as the measure of the utility of social amenities across the country division. 2-dimensional layers of measures and utility were placed on the base population.

  1. Comparison of Population Category across School Divisions: The dashboard showed banked individuals, unbanked individuals and Inter-bank senders against primary, mixed and tertiary schools per region. Results were filterable by regions, states & LGAs.
  2. Layered Banked over Unbanked State Map: The map views showed the distribution of banked individuals over unbanked individuals across different states and local governments. Results were filterable by regions, states & LGAs.
  3. Distribution of Utilities and Population Category: The dashboard focused on schools, medical facilities, markets and agents with respect to population categories such as banked, unbanked, underbanked individuals, adult, child, total population, agents and inter-bank senders. This gave the distribution of utilities and population across the country as well as the estimated proportion of the population that can be serviced by the available social amenities.
  4. Distribution of utilities over population categories: The map views showed the distribution of utilities (population per social amenities) layered over population categories (for maximum interactivity) across different states and local governments in Nigeria. The LGA map can be filtered when the state map is clicked on providing in-depth state information on the LGA level.

distribution of poor

 

Click here for Visualization