The Data for Governance Hackathon was envisioned to challenge participants to build an integrated analytics platform that will consolidate various disparate data sources to create a unified ‘analytical view’. With this view, independent entities will be analysed for patterns, relationships, commonalities, and triangulated insights generation to drive data-driven strategic decision-making on governance at various levels.
Our platform was designed to serve as a ‘single version of the truth’ for data across all different (public and private) sectors – health, education, financial services, agriculture, etc. However, given the size and scale of an overarching such data project, we decided to create a proof of concept demonstrated in form of a minimum viable product (MVP) using all the data set captured from the GRID platform.
Our objective is to drive the synthetic use of data by expanding the visualization component of the GRID3. This will help the policymakers understand the distribution, patterns, and peculiarities among different segments of the populace at National, state and local government levels.
Our submission involves the implementation of standardized analytics dashboards that will provide visualizations and visibility, especially in regards to how advanced data analytics can enrich data governance at various levels. We explored the possibilities of integrating a diverse set of public data such as population estimate from GRID3 (Geo-Referenced Infrastructure and Demographic Data for Development) project, locations of hospitals, schools and market places in a bid to understand how these distributions affect governance. The granularity of these datasets from disparate sources were provided at local government level. The LGA granular detail enhanced the ability to correlate and implement referential integrity within the data sources.
The overall aim was to draw deeper contextual insights and use the insights generated to offer bespoke solutions. The platform created, made data more discoverable, accessible, interoperable and reusable for industry stakeholders and interested parties. It was envisioned that the platform would grow to include additional domains such as agriculture, education, health, transportation, logistics, etc. in the later years. This document provided comprehensive documentation to explain the objective of the MVP and to demonstrate a new way of using dynamic analytic tools to analyse multiple data to present insights. This document explained the nature of the data sets provided, system architecture, tools and technologies, use-cases, and data governance framework adopted.
DATA
Sources and Data Types
Data sets for the MVP originated from three (5) different sources. These include GRID3, Urban planning, Flood incidences, pipeline vandalization, etc
GRID3 data provided population estimate at LGA level by combining micro-census / survey data with satellite imagery to predict population numbers in accessible areas. Model-based gridded estimates of Nigeria population density along with other information was used to determine a metric’s penetration at LGA level. Aggregating the penetration at LGA, state, regional and national levels provided a hierarchical view and deeper insights to the extent at which individuals at these locations. GRID3 hybrid census was mainly estimates based on survey and geo-spatial data. GRID3 tables had states, LGA, age-band, gender and population estimate.
The GRID3 team provided additional data set that enriched the insights generated from the supply-side data. These include:
- All the schools (primary, secondary and tertiary institutions) in Nigeria
- All the registered medical facilities (Public and Private) in Nigeria
- All the Open Market places in Nigeria
The data provided included the GPS coordinates, LGA, States etc. of these facilities. The list of fields provided is as listed below.
Schools | Medical Facilities | Markets | ||
Id | id | id | ||
geometry.type | geometry.type | geometry.type | ||
geometry.coordinates1 | geometry.coordinates1 | geometry.coordinates1 | ||
geometry.coordinates2 | geometry.coordinates2 | geometry.coordinates2 | ||
geometry_name | geometry_name | geometry_name | ||
geometry_type | latitude | geometry_type | ||
latitude | longitude | latitude | ||
longitude | ri_service_status | longitude | ||
global_id | timestamp | global_id | ||
name | cce_available | name | ||
category | category | settlement_name | ||
sub_type | global_id | market_days_monday | ||
management | name | market_days_tuesday | ||
education | functional_status | market_days_wednesday | ||
poi_type | type | market_days_thursday | ||
ward_code | ownership | market_days_friday | ||
source | ward_code | market_days_saturday | ||
ward_name | source | market_days_sunday | ||
lga_code | ward_name | total_market_days | ||
lga_name | lga_code | market_type_goods | ||
state_code | lga_name | ward_code | ||
state_name | state_code | source | ||
number_of_teachers | state_name | ward_name | ||
number_of_students | alternate_name | lga_code | ||
contact_phone | lga_name | |||
contact_name | state_code | |||
cce_quantity | state_name | |||
cce_lastupdated | other_market_type_goods | |||
accessibility | product_description | |||
settlement_type |
Data Quality
Largely, the data provided was clean, especially, the transactional data sets. However, there were cases of data mismatch, where a local government expected to be in Lagos (say Surulere) was found in Imo state. These cases constituted less than 2% of the entire data set. We curated the data by conducting a preliminary data quality assessment to ascertain the level of inconsistencies in the data. Some efforts taken to curate the data involved creating a lookup table which served as a reference point for the correct mappings of local government and states in the Federation.
To maintain high data quality, data cleansing was performed to address incorrectness, inconsistency & incompleteness in the data set. In order to enhance the data quality, ‘aggressive’ data cleansing steps were followed to assure the accuracy and consistency of the performance by taking these steps:
- Conducted preliminary data quality assessment
- Determined the relevant and important fields for the purpose of data cleansing
- Pulled all relevant fields into a separate temporary space
- Examined the health of the pulled fields in terms of their correctness, completeness, comprehensiveness and consistency of default values, and verified data integrity through either online research and eye-balling
- Cleaned and treated the default values, incorrect Formats, patterned numbers, and segmented each variable term into 4 segments: ‘VALID’ ‘INVALID’ “DEFAULT’ ‘MISSING’
- Checked the distribution and term frequencies of ‘Valid’ segments of each variable to further determine any inconsistent patterns
The data quality process ensured that key steps were fulfilled prior to processing, analysis and visualisation of results based on a number of quality criteria:
- Completeness: Ensured all required data was collected. Data volume collected was monitored and compared to historical collections data volumes, this was to capture the anomalies within set threshold levels of tolerance. In addition, the completeness was verified from dimensions such as location, periods, product plans etc.
- Integrity: Ensured only valid data was collected. This criterion verified the referential integrity i.e. orphan data or data without reference existed in the collected data and checked for abnormal distributions of values
- Consistency: Ensured new data had the same characteristics as already processed data. Data values were verified to ensure their behaviour was in line with previous loads. For instance, checksums on specific fields should be within the same range as the past history.
- Latency: Ensured data was made available in for processing within set limits of acceptance of time. Verify that production delays were respected and data is made available for collection within a specific threshold
- Traceability: Ensured that procedures could be traced back for audit purposes.
TOOLS, TECHNIQUES AND TECHNOLOGIES
- Structured Query Language (SQL) is a domain-specific language that enabled us to create and operate on relational databases. It was used to explore and manipulate the data provided in the project.
- The SQL server: Microsoft SQL Server was the relational database management system used for the project. We adopted SQL Server primary to store and retrieve data as required for this project. SQL Server provided a wide variety of transaction processing, business intelligence, functions and procedures for analytics applications.
- Tableau: Tableau was used to design all the dashboards
USE CASES: Governance Map
Use Case 1: Comparison of Population Distribution across
The analysis was done on the distribution of agents against the adult population by region, state, & LGAs, fraud/theft, the distribution of amount involved in fraud/theft and complaint against the adult population by region, state, & LGA. Additionally, there was also a measure of central tendency distribution of agents, fraud/theft, the amount involved and complaint per state at LGA level
Use Case 2: Multi-Dimension Layers of Measure over Base Population
Identifying the social amenities, child population per school, population per medical facilities & agents per market distribution across regions, states & LGAs.
- Population Distribution Against Socio-amenities: Distribution of total population, schools, medical facilities and markets against schools, medical facilities & market places.
- Academic Facility Utility for Child Population: This gave the distribution of schools and child population across the country as well as the estimated proportion of the child population that can be serviced by the available primary, secondary, and tertiary schools.
- Medical Facility for Total Population: This focused on medical facility utility with respect to the total population estimate in Nigeria. It also provided an estimate on the proportion of the population that can be serviced by the available primary, secondary and tertiary medical facilities.
- Market Utility for Agents and Total Population: This focused on market utility with respect to the total population estimate and agents in Nigeria. It also provided the distribution of markets and population across the country as well as the estimated proportion of agents that can service the markets across local governments.
Use Case 3: Correlation Between Identifiable Individuals, Social Amenities at LGA Granulariry
Comparing population category across the country division with further analysis on identifying the correlation between the population category and social amenities as well as the measure of the utility of social amenities across the country division. 2-dimensional layers of measures and utility were placed on the base population.
- Comparison of Population Category across School Divisions: The dashboard showed banked individuals, unbanked individuals and Inter-bank senders against primary, mixed and tertiary schools per region. Results were filterable by regions, states & LGAs.
- Layered Banked over Unbanked State Map: The map views showed the distribution of banked individuals over unbanked individuals across different states and local governments. Results were filterable by regions, states & LGAs.
- Distribution of Utilities and Population Category: The dashboard focused on schools, medical facilities, markets and agents with respect to population categories such as banked, unbanked, underbanked individuals, adult, child, total population, agents and inter-bank senders. This gave the distribution of utilities and population across the country as well as the estimated proportion of the population that can be serviced by the available social amenities.
- Distribution of utilities over population categories: The map views showed the distribution of utilities (population per social amenities) layered over population categories (for maximum interactivity) across different states and local governments in Nigeria. The LGA map can be filtered when the state map is clicked on providing in-depth state information on the LGA level.