{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Forest-based Classification: Predict asthma rates\n", "\n", "> * 👟 Ready To Run!\n", "* 📝 Requires Notebook Server Advanced License\n", "* 📥 Requires ArcPy\n", "* 🔍 Data Science\n", "* 💻 Predictive Modeling" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example notebook examines childhood asthma hospitalization rates in the state of Connecticut and then builds a prediction model to use for other data. The asthma data we have is only available at the census tract level, and as is the case with much of the available public health data, some data values are missing. This notebook will show how to use the Forest-based Classification and Regression tool to use known values from hospitalization rates for census tracts and build a model to predict asthma hospitalization rates for the finer-grained census block group level." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we will be covering the following topics:\n", " * How to import desired packages\n", " * How to set an ArcPy workspace and build a predictive model\n", " * How to work with ArcPy functions to perform a predictive analysis with the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's import all the packages we'll need through the notebook\n", "\n", "ArcGIS API for Python is a Python library for working with maps and geospatial data, powered by web GIS. It provides simple and efficient tools for sophisticated vector and raster analysis, geocoding, map making, routing and directions, as well as for organizing and managing a GIS with users, groups and information items. In addition to working with your own data, the library enables access to ready to use maps and curated geographic data from Esri and other authoritative sources. \n", "\n", "ArcPy is a Python site package that provides a useful and productive way to perform geographic data analysis, data conversion, data management, and map automation with Python. Through ArcPy, we can gain access to different tools from within the ArcGIS Desktop and ArcGIS Pro platforms along with reference documentation for each function, module, and class.\n", "\n", "Let's import both the ArcGIS API for Python and ArcPy, and the additional Python packages we'll use throughout this notebook:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import shutil\n", "import zipfile\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "from arcgis.gis import GIS\n", "import arcpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the version of `arcpy` installed in our environment:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'10.7.0'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arcpy.GetInstallInfo()['Version']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Automatically authenticate to your WebGIS using your credentials and the `home` keyword:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "gis = GIS(\"home\", verify_cert=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model building" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will build a predictive model based on children's hospitalization rates from asthma for census tract data in Connecticut to then subsequently help us forecast the hospitalization rates from asthma for children in Connecticut census block groups.\n", "\n", "With the results of this model, an organization can better allocate resources so hospitals have the staff, treatments, and educational programs necessary to effectively help families with children suffering from asthma-related complications." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To begin, we will copy our analysis data from the `samplesdata` directory into our `home` directory and unzip it. Then we will set our environment’s workspace to this file geodatabase so we can access it for our analysis. We can use the Python `os`, `shutil`, and `zip` packages to manage the file system functions to move our data and unzip it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare data and set ArcPy workspace" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GDB successfully copied to /arcgis/home/Asthma_InitialAnalysis.gdb/\n" ] } ], "source": [ "arcgis_dir = os.path.abspath(os.path.join(os.sep, 'arcgis', ''))\n", "home_dir = os.path.join(arcgis_dir, 'home', '')\n", "\n", "def copy_sample_gdb_to_home(gdb_zip_name):\n", " \"\"\"Given the full filename (with extensions) of a gdb zip file in \n", " /arcgis/samplesdata/, will copy and unzip that gdb to /arcgis/home/\n", " Will return the full path to the unzipped gdb in home\"\"\"\n", "\n", " # Get the full paths of all the source and destination files to copy\n", " gdb_dir_name = gdb_zip_name.split(\".zip\")[0]\n", " gdb_zip_path_src = os.path.join(arcgis_dir, 'samplesdata', \n", " gdb_zip_name)\n", " gdb_dir_path_src = os.path.join(arcgis_dir, 'samplesdata', \n", " gdb_dir_name)\n", " gdb_zip_path_dst = os.path.join(home_dir, gdb_zip_name)\n", " gdb_dir_path_dst = os.path.join(home_dir, gdb_dir_name)\n", "\n", " # If the gdb has been copied/unzipped to home dir before, delete it\n", " if os.path.exists(gdb_zip_path_dst):\n", " os.remove(gdb_zip_path_dst)\n", " if os.path.exists(gdb_dir_path_dst):\n", " shutil.rmtree(gdb_dir_path_dst)\n", "\n", " # Copy the zip file to home, unzip it\n", " shutil.copy(gdb_zip_path_src, gdb_zip_path_dst)\n", " zip_ref = zipfile.ZipFile(gdb_zip_path_dst, 'r')\n", " zip_ref.extractall(home_dir)\n", " zip_ref.close()\n", " \n", " # Return the output full path to /arcgis/home/unzipped_gdb/\n", " return os.path.join(gdb_dir_path_dst, '') # Adds trailing slash\n", "\n", "gdb_path = copy_sample_gdb_to_home('Asthma_InitialAnalysis.gdb.zip')\n", "print(f\"GDB succesfully copied to {gdb_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List the feature classes" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\Connecticut_Tracts\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\ConnecticutTracts_WithAsthma\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\annual_conc_by_monitor_2010_\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\annual_conc_by_monitor_2010_Conn\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\ConnecticutTracts_WithAsthma_Enriched\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\ConnecticutBGs\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\ConnecticutBGs_WithAsthma_Enriched\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\BlockGroups_PredictedAsthma\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\ConnecticutTracts_Enriched2\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_2010_XYTableToPoint\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_2010_wTotalRelease\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_2010_Air\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_2010_Air_wTotalRelease\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\ConnecticutTracts_WithAsthma_Enriched2\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\BGs_PredictedAsthmaV2\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_Air\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_Air_HDBSCAN_10\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_Air_HDBSCAN_25\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_Air_OPTICS_25\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TRI_Air_OPTICS_15\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittburghAreaBusinesses_Rest\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittRestaurants_HDBSCAN10\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittRestaurants_Optics10_Sens90\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittRestaurants_Optics10_Sens90_Ellipses\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittRestaurants_Ellipses_Enriched\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittRestaurants_Ellipses_Enriched_Index\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittRestaurants_Ellipses_Enriched_Business\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittRestaurants_Ellipses_Enriched_BusinessPlus\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\TractswithAsthma_Projected\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PredictedBGs_test1\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_HDBSCAN\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_ConcaveHull_1\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_HDBSCAN_Ellipses\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\HDBSCAN_Ellipses_Enriched_Business\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\RestaurantClusters_DensityBa\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Densit1\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\Connecticut_StateBoundary\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Densit\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Densit2\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Densit3\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Densit4\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Densit5\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Densit6\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Densit7\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\ConnecticutTracts_WithAsthma1\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\ConnecticutTracts_WithAsthma_Enrich_SmokingOnly\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\PittsburghRestaurants_Centra\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\OutputBlockGroupTest\n" ] } ], "source": [ "arcpy.env.workspace = gdb_path\n", "\n", "datasets = arcpy.ListDatasets()\n", "datasets = [''] + datasets if datasets is not None else []\n", "\n", "for ds in datasets:\n", " if ds:\n", " if arcpy.Describe(ds).dataType == \"Feature Dataset\":\n", " for fc in arcpy.ListFeatureClasses(feature_dataset=ds):\n", " path = os.path.join(arcpy.env.workspace, ds, fc)\n", " print(path)\n", " else:\n", " for fc in arcpy.ListFeatureClasses():\n", " path = os.path.join(arcpy.env.workspace, fc)\n", " print(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List the raster datasets" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\GALayerToRas1\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\RoadDensity\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\DistanceToAirToxicReleases\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\Kriging_AirQuality\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\Idw_AirQuality\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\AirQualityRaster\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\AirQualityEBK\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\DistanceToPrimarySecondaryRoads\n", "/arcgis/home/Asthma_InitialAnalysis.gdb/Asthma_InitialAnalysis.gdb\\RoadDensity_ForViz\n" ] } ], "source": [ "for ds in datasets:\n", " if ds:\n", " if arcpy.Describe(ds).datatype == \"RasterDataset\":\n", " path = os.path.join(arcpy.env.workspace, ds)\n", " print(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize the asthma data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have census tract data for the state of Connecticut to begin training our model. We used the [`Enrich`](https://esri.github.io/arcgis-python-api/apidoc/html/arcgis.geoenrichment.html#enrich) tool to add a variety of attributes thought to influence childhood asthma hospitalization rates to this base census data. We added socioeconomic and behavioral variables such as median income, smoking rates, and insurance spending as well as environmental variables like road density, toxic releases, and air quality. (See [Esri Demographics](http://doc.arcgis.com/en/esri-demographics/) for detailed information on the variables avaiable to add to your data, and the [GeoEnrichment service](https://developers.arcgis.com/rest/geoenrichment/api-reference/geoenrichment-service-overview.htm) documentation to learn more on enriching your data. Also see the Analyze Patterns in Construction Permits sample notebook for an example.)\n", "\n", "we can use the Python `pandas` package and the `spatial` property to load the census feature class into a [`Spatially Enabled DataFrame`](https://esri.github.io/arcgis-python-api/apidoc/html/arcgis.features.toc.html#geoaccessor) to get a better sense of the data and what it looks like: " ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "asthma_path = os.path.join(gdb_path, 'TractswithAsthma_Projected', '')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OBJECTID_1ObjectIDSTATE_FIPSCNTY_FIPSSTCOFIPSTRACTFIPSPOP2010POP10_SQMIPOP2012...industry_indmanu_cy_pindustry_indcons_cy_pindustry_indmin_cy_pindustry_indagri_cy_pindustry_indtran_cy_pindustry_unemprt_cyhealth_x8002_x_ahouseholds_acshhbpov_phouseholds_acshhapov_pSHAPE
01128470900109001010101090010101014476351.14454...2.526.570.00.550.004.110213.6012.0287.98{\"rings\": [[[-8198749.507912934, 5021805.12917...
12128480900109001010102090010101024330330.04328...3.594.570.30.000.734.512896.887.1992.81{\"rings\": [[[-8193156.48533997, 5027165.476578...
23128490900109001010201090010102013421775.73489...5.012.430.00.001.882.613360.914.4495.56{\"rings\": [[[-8196838.721007202, 5023675.21989...
341285009001090010102020900101020253591402.95269...2.106.830.00.001.404.39921.791.2498.76{\"rings\": [[[-8190039.435066799, 5020903.23340...
451285109001090010103000900101030040101078.04032...7.370.290.00.001.665.510646.402.7297.28{\"rings\": [[[-8197306.487002994, 5019559.45284...
\n", "

5 rows × 83 columns

\n", "
" ], "text/plain": [ " OBJECTID_1 ObjectID STATE_FIPS CNTY_FIPS STCOFIPS TRACT FIPS \\\n", "0 1 12847 09 001 09001 010101 09001010101 \n", "1 2 12848 09 001 09001 010102 09001010102 \n", "2 3 12849 09 001 09001 010201 09001010201 \n", "3 4 12850 09 001 09001 010202 09001010202 \n", "4 5 12851 09 001 09001 010300 09001010300 \n", "\n", " POP2010 POP10_SQMI POP2012 \\\n", "0 4476 351.1 4454 \n", "1 4330 330.0 4328 \n", "2 3421 775.7 3489 \n", "3 5359 1402.9 5269 \n", "4 4010 1078.0 4032 \n", "\n", " ... industry_indmanu_cy_p \\\n", "0 ... 2.52 \n", "1 ... 3.59 \n", "2 ... 5.01 \n", "3 ... 2.10 \n", "4 ... 7.37 \n", "\n", " industry_indcons_cy_p industry_indmin_cy_p industry_indagri_cy_p \\\n", "0 6.57 0.0 0.55 \n", "1 4.57 0.3 0.00 \n", "2 2.43 0.0 0.00 \n", "3 6.83 0.0 0.00 \n", "4 0.29 0.0 0.00 \n", "\n", " industry_indtran_cy_p industry_unemprt_cy health_x8002_x_a \\\n", "0 0.00 4.1 10213.60 \n", "1 0.73 4.5 12896.88 \n", "2 1.88 2.6 13360.91 \n", "3 1.40 4.3 9921.79 \n", "4 1.66 5.5 10646.40 \n", "\n", " households_acshhbpov_p households_acshhapov_p \\\n", "0 12.02 87.98 \n", "1 7.19 92.81 \n", "2 4.44 95.56 \n", "3 1.24 98.76 \n", "4 2.72 97.28 \n", "\n", " SHAPE \n", "0 {\"rings\": [[[-8198749.507912934, 5021805.12917... \n", "1 {\"rings\": [[[-8193156.48533997, 5027165.476578... \n", "2 {\"rings\": [[[-8196838.721007202, 5023675.21989... \n", "3 {\"rings\": [[[-8190039.435066799, 5020903.23340... \n", "4 {\"rings\": [[[-8197306.487002994, 5019559.45284... \n", "\n", "[5 rows x 83 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tract_sedf = pd.DataFrame.spatial.from_featureclass(asthma_path)\n", "tract_sedf.spatial.project(spatial_reference = {'wkid': 3857})\n", "tract_sedf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have our data in a pandas data frame, we can begin to explore it by performing some initial exploratory analyses. We begin by plotting the variable 'U19Rate', which is childhood asthma hospitalization rate, to get a sense for how its values change over space across the census tracts of Connecticut. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "asthma_map = gis.map(\"Connecticut\")\n", "asthma_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tract_sedf.spatial.plot(asthma_map, \n", " col='U19Rate',\n", " renderer_type='c', \n", " method='esriClassifyNaturalBreaks',\n", " class_count = 5,\n", " cmap='OrRd', \n", " alpha=0.75, \n", " line_width = 0.5) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variables influencing hospitalization rates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned earlier, we enriched the base census tract data with variables thought to influence the hospitalization rates of children under 19 affected by asthma. The Geoerichment service added each variable as an attribute in the original feature class data, and we can now see these values in the dataframe we created.\n", "\n", "Let's create a list of the variables we added alongside some descriptive information about what the values of the attributes mean." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "variables = ['healthpersonalcare_mp32001a_b_p',\n", " 'householdincome_pci_cy',\n", " 'householdincome_medhinc_cy',\n", " 'educationalattainment_hsgrad_cy_p',\n", " 'educationalattainment_bachdeg_cy_p',\n", " 'industry_unemprt_cy',\n", " 'health_x8002_x_a',\n", " 'households_acshhbpov_p',\n", " 'households_acshhapov_p']" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "healthpersonalcare_mp32001a_b_p Smoked cigarettes in last 12 months: Percent\n", "householdincome_pci_cy 2018 Per Capita Income\n", "householdincome_medhinc_cy 2018 Median Household Income\n", "educationalattainment_hsgrad_cy_p 2018 Education: High School Diploma: Percent\n", "educationalattainment_bachdeg_cy_p 2018 Education: Bachelor's Degree: Percent\n", "industry_unemprt_cy 2018 Unemployment Rate\n", "health_x8002_x_a Health Insurance: Average\n", "households_acshhbpov_p ACS HHs: Inc Below Poverty Level: Percent\n", "households_acshhapov_p ACS HHs:Inc at/Above Poverty Level: Percent\n" ] } ], "source": [ "asthma_desc = arcpy.Describe(asthma_path)\n", "for field in asthma_desc.fields:\n", " if field.name in variables:\n", " print(f\"{field.name:40}{field.aliasName}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also interact with functions from other python packages to peform a cursory examination of the relationship each of these variables have to each other. Using [`seaborn`](https://seaborn.pydata.org/), we will create a scatterplot matrix of all the variables within our data frame. This provides a sense for the relationship between these different possible explanatory variables. \n", "\n", "After we have explored our data, we can begin to prepare for training an initial model to see how well it predicts childhood asthma hospitialization rates based on the information we have so far for it to learn from. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.pairplot(tract_sedf[variables])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting environment variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we begin training our model, we will define several environmental variables in our workspace. The tool we use these variables during its execution.\n", "\n", "First, we opt to set a random seed so that the results from this notebook can be easily duplicated. \n", "\n", "Second, we elect to overwrite any output if we choose to save something with the same name as a previous run's output. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "arcpy.env.randomGenerator = \"540977 ACM599\"\n", "arcpy.env.overwriteOutput = True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our model, we will use the [Forest-based Classification and Regression tool](https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/forestbasedclassificationregression.htm) which uses the random forest algorithm to build decision trees for predicting values for unknown commodities. Random forests are versatile in the types of data they can work with and powerful in predicting outcomes. We will attempt to find a good model to predict childhood asthma hospitalization rates for census block groups. \n", "\n", "The process of building a predictive model is iterative. It often requires multiple model runs to find the correct combination of explanatory variables and model settings that produce the best fit. The first step is to train the model for prediction. We will take advantage of the Forest-based tool's 'TRAIN' mode to try different models, settings, and options before making predictions. Training builds a forest that establishes a relationship between explanatory variables and the unknown variable we are trying to predict. The train mode allows us to save time so we don't actually run the model on unknown data until we have found a good model. \n", "\n", "The Forest-based tool not only contains explanatory variables from our input point or polygon feature class, but can also factor in distance features or explanatory raster layers to help increase prediction accuracy. For any input features, distance can be calculated to the nearest distance feature and used as an explanatory variable. Additionally, the tool will resample input raster layers and automatically extract values from the raster for each input feature. The tool performs this behind the scenes with just a few lines of code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's set certain variables the tool to will use. See [tool documentation](https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/forestbasedclassificationregression.htm#GUID-089174E4-DE5F-4E8E-BBE9-084A52BB1508) for details:\n", " * Mode - train or train+predict\n", " * Input Features - the input dataset\n", " * Prediction Variable - the variable to predict\n", " * Categorical Variable - whether or not the variable to predict is categorical or continuous" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "prediction_type = \"TRAIN\"\n", "input_fc = asthma_path\n", "predict_var = \"U19Rate\"\n", "categorical_var = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, in order to run the Forest-based Classification and Regression tool, we need to specify explanatory variables. These could be tabular variables such as attributes of our input features...\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "explanatory_var0 = \"educationalattainment_bachdeg_cy_p\"\n", "explanatory_var1 = \"educationalattainment_hsgrad_cy_p\"\n", "explanatory_var2 = \"householdincome_medhinc_cy\"\n", "explanatory_var3 = \"householdincome_pci_cy\"\n", "explanatory_var4 = \"industry_unemprt_cy\"\n", "explanatory_var5 = \"households_acshhbpov_p\"\n", "explanatory_var6 = \"households_acshhapov_p\"\n", "explanatory_var7 = \"health_x8002_x_a\"\n", "explanatory_var8 = \"healthpersonalcare_mp32001a_b_p\"\n", "explanatory_vars = \"{} false;{} false;{} false;{} false;{} false;{} false;{} false;{} false;{} false\".format(explanatory_var0, \n", " explanatory_var1, explanatory_var2, explanatory_var3, \n", " explanatory_var4, explanatory_var5, explanatory_var6, \n", " explanatory_var7, explanatory_var8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... or distance features calculating how far input features are from distance features... (we don't use any here, so let's specify \"None\")\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "distance_fc = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... or finally, explanatory rasters!\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "explanatory_rst0 = os.path.join(gdb_path, 'AirQualityEBK')\n", "explanatory_rst1 = os.path.join(gdb_path, 'DistanceToAirToxicReleases')\n", "explanatory_rst2 = os.path.join(gdb_path, 'DistanceToPrimarySecondaryRoads')\n", "explanatory_rst3 = os.path.join(gdb_path, 'RoadDensity')\n", "explanatory_raster = \"{} false;{} false;{} false;{} false\".format(explanatory_rst0, \n", " explanatory_rst1, explanatory_rst2, explanatory_rst3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we've set all these variables so we can use them to train the initial model. There are a number of other parameters necessary to run the tool. Please see the reference for the [tool](https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/forestbasedclassificationregression.htm#S_GUID-F98AF1AD-4466-4805-899E-52D60C9BC1E9) for detailed explanations on each. We'll set the final variables and then run the model." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "fc_to_predict = input_fc # or None\n", "out_fc = os.path.join(gdb_path, 'TractswithAsthma_Projected_out') # or None\n", "out_raster = None\n", "match_exp_var = None\n", "match_dist_fc = None\n", "match_exp_raster = None\n", "out_trained_fc = None\n", "out_var_importance_table = os.path.join(gdb_path, 'Asthma_Train_Variable_Importance_Table')\n", "raster_values=True\n", "num_trees = 100\n", "min_leaf_size = None\n", "max_level = None\n", "sample_size = 100\n", "fields_to_try = None\n", "perc_training = 10\n", "\n", "randomforest_toolrun = arcpy.stats.Forest(prediction_type, input_fc, predict_var, \n", " categorical_var, explanatory_vars, distance_fc, explanatory_raster, \n", " fc_to_predict, out_fc, out_raster, match_exp_var, match_dist_fc, \n", " match_exp_raster, out_trained_fc, out_var_importance_table, \n", " raster_values, num_trees, min_leaf_size, max_level, sample_size, \n", " fields_to_try, perc_training)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have run a training operation on the model, we can view output messages to inspect aspects of how the parameters we chose performed. These messages contain detailed diagnostic information. Not only can we see model characteristics pertaining to our forest, but also which explanatory variables were most important in influencing hospitalization rates. The tool will also automatically split your input data into training and testing data sets and perform diagnostics on both. This allows us to compare how the model performed to predict data it was previously trained on as well as additional data it had not been exposed to. This provides a comparision between actual observed values in the training data to values predicted by the model. This gives us a good idea on how our model will hold up when we try to predict for completely new features. See [here](https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-forest-works.htm#ESRI_SECTION1_E010566C782444E2B2364442355FC50D) for details. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "info = randomforest_toolrun.getMessages().split('\\n')[6:-2]\n", "print(*info, sep = \"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall, our model looks promising! Our R-squared value for our testing data set was 0.869. This means that about 87% of our response variable's variation was explained by the explanatory variables we selected in our model. \n", "\n", "If we are happy with this result, we can switch the tool's mode from 'TRAIN' to 'PREDICT FEATURES' to see how the model performs on the new data we wish to predict to. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While initially we only had data at the census tract level, if we wish to predict to a finer scale, we can supply the tool with prediction locations at the block group level. As long as the data we wish to predict to is enriched with the same attribute values, our model can be applied and predictions can be given to each location. The explanatory variables (and distance features and explanatory rasters) must match between the trianing data set and the predicting data set. For our example, we matched the explanatory variables (as attribute fields) we used from our census tract data for Connecticut to the corresponding explanatory variables in our block group data set, providing attribute field names as well as descriptions to match to so the tool knows which columns are the same even if names or path locations have changed. We did the same for the explanatory rasters we used to train the model, providing the full path to the same rasters. " ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "prediction_type = \"PREDICT_FEATURES\"\n", "input_fc = asthma_path\n", "predict_var = \"U19Rate\"\n", "categorical_var = None\n", "explanatory_var0 = \"educationalattainment_bachdeg_cy_p\"\n", "explanatory_var1 = \"educationalattainment_hsgrad_cy_p\"\n", "explanatory_var2 = \"householdincome_medhinc_cy\"\n", "explanatory_var3 = \"householdincome_pci_cy\"\n", "explanatory_var4 = \"industry_unemprt_cy\"\n", "explanatory_var5 = \"households_acshhbpov_p\"\n", "explanatory_var6 = \"households_acshhapov_p\"\n", "explanatory_var7 = \"health_x8002_x_a\"\n", "explanatory_var8 = \"healthpersonalcare_mp32001a_b_p\"\n", "explanatory_vars = \"{} false;{} false;{} false;{} false;{} false;{} false;{} false;{} false;{} false\".format(explanatory_var0, explanatory_var1, \n", " explanatory_var2, explanatory_var3, explanatory_var4, \n", " explanatory_var5, explanatory_var6, explanatory_var7, explanatory_var8)\n", "distance_fc = None\n", "\"\"\" # previously defined\n", "explanatory_rst0 = \"/arcgis/home/Asthma_InitialAnalysis.gdb/AirQualityEBK\"\n", "explanatory_rst1 = \"/arcgis/home/Asthma_InitialAnalysis.gdb/DistanceToAirToxicReleases\"\n", "explanatory_rst2 = \"/arcgis/home/Asthma_InitialAnalysis.gdb/DistanceToPrimarySecondaryRoads\"\n", "explanatory_rst3 = \"/arcgis/home/Asthma_InitialAnalysis.gdb/RoadDensity\"\n", "explanatory_raster = \"{} false;{} false;{} false;{} false\".format(explanatory_rst0, \n", " explanatory_rst1, explanatory_rst2, explanatory_rst3)\n", "\"\"\"\n", "fc_to_predict = os.path.join(gdb_path, 'ConnecticutBGs_WithAsthma_Enriched')\n", "out_fc = os.path.join(gdb_path, 'Asthma_Predictions_Block_Groups')\n", "out_raster = None\n", "match_exp_var0A = 'educationalattainment_bachdeg_cy_p'\n", "match_exp_var0B = '\\\"2018 Education: Bachelor\\'s Degree: Percent\\\"'\n", "match_exp_var1A = 'educationalattainment_hsgrad_cy_p'\n", "match_exp_var1B = '\\'2018 Education: High School Diploma: Percent\\''\n", "match_exp_var2A = 'householdincome_medhinc_cy'\n", "match_exp_var2B = '\\'2018 Median Household Income\\''\n", "match_exp_var3A = 'householdincome_pci_cy'\n", "match_exp_var3B = '\\'2018 Per Capita Income\\''\n", "match_exp_var4A = 'industry_unemprt_cy'\n", "match_exp_var4B = '\\'2018 Unemployment Rate\\''\n", "match_exp_var5A = 'households_acshhbpov_p'\n", "match_exp_var5B = '\\'ACS HHs: Inc Below Poverty Level: Percent\\''\n", "match_exp_var6A = 'households_acshhapov_p'\n", "match_exp_var6B = '\\'ACS HHs:Inc at/Above Poverty Level: Percent\\''\n", "match_exp_var7A = 'health_x8002_x_a'\n", "match_exp_var7B = '\\'Health Insurance: Average\\''\n", "match_exp_var8A = 'healthpersonalcare_mp32001a_b_p'\n", "match_exp_var8B = '\\'Smoked cigarettes in last 12 months: Percent\\''\n", "match_exp_var = \"{} {};{} {};{} {};{} {};{} {};{} {};{} {};{} {};{} {}\".format(match_exp_var0A, match_exp_var0B, match_exp_var1A, match_exp_var1B, \n", " match_exp_var2A, match_exp_var2B, match_exp_var3A, match_exp_var3B, \n", " match_exp_var4A, match_exp_var4B, match_exp_var5A, match_exp_var5B, \n", " match_exp_var6A, match_exp_var6B, match_exp_var7A, match_exp_var7B, \n", " match_exp_var8A, match_exp_var8B)\n", "match_dist_fc = None\n", "match_raster0A = explanatory_rst0\n", "match_raster0B = explanatory_rst0\n", "match_raster1A = explanatory_rst1\n", "match_raster1B = explanatory_rst1\n", "match_raster2A = explanatory_rst2\n", "match_raster2B = explanatory_rst2\n", "match_raster3A = explanatory_rst3\n", "match_raster3B = explanatory_rst3\n", "match_exp_raster = \"{} {};{} {};{} {};{} {}\".format(match_raster0A,match_raster0B,\n", " match_raster1A,match_raster1B,match_raster2A,match_raster2B,\n", " match_raster3A,match_raster3B)\n", "out_trained_fc = None\n", "out_var_importance_table = os.path.join(gdb_path, 'Asthma_Prediction_Variable_Importance_Table')\n", "raster_values = True\n", "num_trees = 100\n", "min_leaf_size = None\n", "max_level = None\n", "sample_size = 100\n", "fields_to_try = None\n", "perc_training = 10\n", "\n", "randomforest_predict_toolrun = arcpy.stats.Forest(prediction_type, input_fc, predict_var, \n", " categorical_var,explanatory_vars, distance_fc, explanatory_raster, \n", " fc_to_predict, out_fc, out_raster, match_exp_var,match_dist_fc, \n", " match_exp_raster, out_trained_fc, out_var_importance_table, \n", " raster_values, num_trees, min_leaf_size, max_level, sample_size, \n", " fields_to_try, perc_training)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, we can inspect our output messages to see how our model performed. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "info = randomforest_predict_toolrun.getMessages().split('\\n')[8:-2]\n", "print(*info, sep = \"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing prediction output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll create a spatially-enabled dataframe to visualize the feature class output from the prediction run of the tool:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "out_sdef =pd.DataFrame.spatial.from_featureclass(out_fc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Publish the dataframe as a feature layer to the Organization." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "asthma_item = out_sdef.spatial.to_featurelayer(title=\"ct_asthma_predicted_block_group\",\n", " gis=gis,\n", " tags=['data science', 'predictive', 'asthma'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualize the results on a map. Use the Smart Mapping class breaks renderer to symbolize the features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predicted_map = gis.map(\"Connecticut\")\n", "predicted_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add the item to the map, using Smart Mapping to render the feature layer with a classed breaks renderer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predicted_map.add_layer(asthma_item, {\"renderer\":\"ClassedColorRenderer\",\n", " \"field_name\":\"predicted\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Immediately, we can see that we have filled in the holes of missing data all over Connecticut and we have produced much more detailed spatial data. Using census tract data with asthma hospitalization rates, we created a more comprehensive picture of hospitalization rates by modeling those rates to the census block group level. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally, we can create a scatterplot matrix similar to what we ran above to examine the relationships that now exist between all our explanatory variables and rasters. First we'll create a list of the explanatory variable values (from the column headings of our dataframe), then input those to the seaborn pairplot tool to create the matrix: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "columns = list(out_sdef)[2:-1]\n", "sns.pairplot(out_sdef[columns])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this sample notebook, we perform a machine learning analysis using ArcGIS Platform statistical tools to predict asthma hospitalization rates at the census block group level. \n", "\n", "We begin with state of Connecticut census tract data contaninng attributes about asthma hospitalization rates and enrich it with a set of variables thought to influence those hospitalization rates. From there, using a combination of attributes and rasters, we used the ArcPy Forest-based Classifiction and Regression tool to train a model to predict values covering the same spatial extent (state of Connecticut) but at the census block group level. \n", "\n", "This notebook shows a seamles fusing of machine learning power from ArcPy with the data and mapping power from the ArcGIS Python API in an easy to run and share Jupyter notebook format. " ] } ], "metadata": { "esriNotebookRuntime": { "notebookRuntimeName": "ArcGIS Notebook Python 3 Advanced", "notebookRuntimeVersion": "10.7.1" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }