admin管理员组

文章数量:1564638

2023年12月13日发(作者:)

ArcGIS回归分析教程Analyzing 911 response data using RegressionThis tutorial demonstrates how regression analysis has been implemented in ArcGIS, and explores some of the specialconsideratio ns you’ll want to think about whenever you use regression with spatial sion analysis allows you to model, examine, and explore spatial relationships, to better understand the factors behindobserved spatial patterns, and to predict outcomes based on that understanding. Ordinary Least Squares regression (OLS) is aglobal regression method. Geographically Weighted Regression (GWR) is a local, spatial, regression method that allows therelationships you are modeling to vary across the study area. Both of these are located in the Spatial Statistics Tools -> ModelingSpatial Relationships toolset:Before executing the tools and examining the results, let’s review some terminology:Dependent variable (Y): what you are trying to model or predict (residential burglary incidents, for example).Explanatory variables (X): variables you believe influence or help explain the dependent variable (like: income, the number ofvandalism incidents, or households).Coefficients (β): values, computed by the regression tool, reflecting the relationship and strength of each explanatory variable tothe dependent als (ε): the portion of the dependent variable that isn’t explained by the model; the model under and over sign (+/-) associated with the coefficient (one for each explanatory variable) tells you whether the relationship is positive ornegative. If you were modeling residential burglary and obtain a negative coefficient for the Income variable, for example, it wouldmean that as median incomes in a neighborhood go up, the number of residential burglaries goes from regression analysis can be a little overwhelming at first. It includes diagnostics and model performance indicators. Allof these numbers should seem much less daunting once you complete the tutorial ant notes:1. The steps in this tutorial document assume the data is stored at C:SpatialStats. If adifferent location is used, substitute "C:SpatialStats" with the alternate location when entering data and environment paths.2. This tutorial was developed using ArcGIS 10.0. If you are using a different version ofthe software, the screenshots and how you access results, may be a bit al Estimated time: 1.5 hours Introduction:In order to demonstrate how the regression tools work, you will be doing an analysis of 911 Emergency call data for a portion ofthe Portland Oregon metropolitan e we have a community that is spending a large portion of its public resources responding to 911 emergency tions are telling them that their community’s population is going to double in size over the next 10 years. If they can betterunderstand some of the factors contributing to high call volumes now, perhaps they can implement strategies to help reduce 911calls in the 1 Getting StartedOpen C: (the path may be different on your machine)In this map document you will notice several Data frames containing layers of data for thePortland Oregon metropolitan study that the Hot Spot Analysis data frame is activeIn the map, each point represents a single call into a 911 emergency call center. This is real data representing over 2000 2 Examine Hotspot Analysis resultsExpand the data frame and click the + sign to the right of the Hot Spot Analysis grouped layer Ensure that the ResponseStations layer is checked onResults from running the Hotspot Analysis tool show us where the community is getting lots of 911 calls. We can use theseresults to assess whether or not the stations (fire/police/emergency medical) are optimally with high call volumes are shown in red (hot spots); areas getting very few calls are shown in blue (cold spots). The greencrosses are the existing locations for the police and fire units tasked with responding to these 911 that the 2 stations to the right of the map appear to be located right over, or very near, call hot spots. The station in thelower left, however, is actually located over a cold spot; we may want to investigate further if this station is in the best community can use hot spot analysis to decide if adding new stations or relocating existing stations might improve 911 3 Exploring OLS RegressionThe next question our community is probably asking is, “Why are call volumes so high in those hot spot areas?” and “What arethe factors that contribute to high volumes of 911 calls?” To help answer these questions, we’ll use the regression tools te the Regression Analysis data frame by right clicking and choosing ActivateExpand the Spatial Statistics tools toolboxRight click in a open space in ArcToolbox and set your environment as follows:Disable background processes (Geoprocessing>Geoprocessing Options). With ArcGIS 10, geoprocessing tools can run in thebackground and all results are available through the Results window. By disabling background processing, we will see toolresults in a progress window;this is often best when you are using the Regression tools:In the data frame, check off the Data911Calls layerInstead of looking at individual 911 calls as points, we have aggregated the calls to census tracts and now have a count variable(Calls) representing the number of calls in each click the ObsData911Calls layer and choose Open Attribute TableThe reason we are using census tract level data is because this gives us access to a rich set of variables that might help explain911 call that the table has fields such as Educational status (LowEd), Unemployment levels (Unemploy), done exploring the fields, close the tableCan you think of anything … any variable… that might help explain the call volume pattern we see in the hot spot map?What about population? Would we expect more calls in places with more people? L et’s test the hypothesis that call volume issimply a function of population. If it is, our community can use Census population projections to estimate future 911 emergencycall the OLS tool with the following parameters:Note: once the tool starts running, make sure the “close this dialog when completed successfully box” is NOT checkedo Input Feature Class -> ObsData911Callso Unique ID Field -> UniqIDo Output Feature Class -> C: Dependent Variable -> Callso Explanatory Variables -> PopMove the progress window to the side so you can examine the OLS911calls layer in the OLS default output is a map showing us how well the model performed, using only the population variable to explain 911 callvolumes. The red areas are under predictions (where the actual number of calls is higher than the model predicted); the blueareas are over predictions (actual call volumes are lower than predicted). When a model is performing well, the over/underpredictions reflect random noise… the model is a little high here, bu t a little low there… you don’t see any structure at all in theover/under predictions. Do the over and under predictions in the output feature class appear to be random noise or do you seeclustering? When the over (blue) and under (red) predictions cluster together spatially, you know that your model is missing oneor more key explanatory OLS tool also produces a lot of numeric output. Expand and enlarge the progress window so you can read this output that the Adjusted R-Squared value is 0.393460, or 39%. This indicates that using population alone, the model isexplaining 39% of the call volume looking back at our original hypothesis, is call volume simply a function of population? Might our community be able to predictfuture 911 call volumes from population projections alone? Probably not; if the relationship between population and 911 callvolumes had been higher, say 80%, our community might not need regression at all. But with only 39% of the story, it seems otherfactors and other variables, are needed to effectively model 911 next question that follows is what are these other variables? This, actually, is the hardest part of the regression modelbuilding process: finding all of the key variables that explain what we are trying to the Progress 4 Finding key variablesThe scatterplot matrix graph can help us here by allowing us to examine the relationships between call volumes and a variety ofother variables. We might guess, for example, that the number of apartment complexes, unemployment rates, income oreducation are also important predictors of 911 call ment with the scatterplot matrix graph to explore the relationships between call volumes and other candidate explanatoryvariables. If you enter the “calls” variable either first or last, it will appear as either the bottom row or the first column in the is an example of scatterplot matrix parameter settings:Once you finish creating the scatterplot matrix, select features in the focus graph and notice how those features are highlightedin each scatterplot and on the 5 A properly specified modelNow let’s try a model with 4 explanatory variables: Pop, Jobs, LowEduc, and Dst2UrbCen. The explanatory variables in thismodel were found by using the Scatterplot matrix and trying a number of candidate models. Finding a properly specified OLSmodel, is often an iterative OLS with the following parameters set:o Input Feature Class -> AnalysisObsData911Callso Unique ID Field -> UniqIDo Output Feature Class ->C: Dependent Variable -> Callso Explanatory Variables -> Pop;Jobs;LowEduc;Dst2UrbCenNotice that the Adjusted R2 value is much higher for this new model, 0.831080, indicating this model explains 83% of the 911 callvolume story. This is a big improvement over the model that only used the Progress , too, that the residuals (the model over/under predictions) appear to be less clustered than they were using only thePopulation can check whether or not the residuals exhibit a random spatial pattern using the Spatial Autocorrelation the Spatial Autocorrelation tool (in the Analyzing Patterns Toolset) using the following parameters:o Input Feature Class → Data911CallsOLSo Input Field → StdResido Generate Report → checked ONo Conceptualization of Spatial Relationships → Inverse Distanceo Distance Method → Euclidean Distanceo Standardization → ROW (with polygons you will almost always want to Row Standardize).Close the Progress Window, then open the Results Window and Expand the entry for Spatial Autocorrelation (if you don’t seethe Results Window, select Geoprocessing from the menu, then Results).Double click on the HTML Report File:Results from running the Spatial Autocorrelation tool on the regression residuals indicates they are randomly distributed; the z-score is not statistically significant so we accept the nullhypothesis of complete spatial randomness. This is good news! Anytime there is structure (clustering or dispersion) of theunder/over predictions, it means that your model is still missing key explanatory variables and you cannot trust your results. Whenyou run the SpatialAutocorrelation tool on the model residuals and find a random spatial pattern (as we did here), you are on your way to a properlyspecified 6: The 6 things you gotta check!There are 6 things you need to check before you can be sure you have a properly specified model – a model you can check to see that each coefficient has the “expected” sign.A positive coefficient means the relationship is positive; a negative coefficient means the relationship is negative. Notice that thecoefficient for the Pop variable is positive. This means that as the number of people goes up, the number of 911 calls also goesup. We are expecting a positive coefficient. If the coefficient for the Population variable was negative, we would not trust ourmodel. Checking the other coefficients, it seems that their signs do seem reasonable. Self check: the sign for Jobs (the number ofjob positions in a tract) is positive, this means that as the number of jobs goes , the number of 911 calls also goes (?). check for redundancy among your explanatory variables. If the VIF value (varianceinflation factor) for any of your variables is larger than about 7.5 (smaller is definitely better), it means you have one or morevariables telling the same story. This leads to an over-count type of bias. You should remove the variables associated with largeVIF values one by one until none of your variables have large VIF values. Self check: Which variable has the highest VIF value?, check to see that all of the explanatory variables have statistically columns, Probability and Robust Probability, measure coefficient statistical significance. An asterisk next to the probabilitytells you the coefficient is significant. If a variable is not significant, it is not helping the model, and unless theory tells us that aparticular variable is critical, we should remove it. When the Koenker (BP) statistic is statistically significant, you can only trust theRobust Probability column to determine if a coefficient is significant or not. Small probabilities are “better” (more significant) thanlarge probabilities. Self check: Which variables have the “best” statistical s ignificance? Did you consult the Probability orRobust_Pr column? Why?! Note: An asterisk indicates statistical sure the Jarque-Bera test is NOT statistically significant:The residuals (over/under predictions) from a properly specified model will reflect random noise. Random noise has a randomspatial pattern (no clustering of over/under predictions). It also has a normal histogram if you plotted the residuals. The Jarque-Bera test measures whether or not the residuals from a regression model are normally distributed (think Bell Curve). This is theone test you do NOT want to be statistically significant! When it IS statistically significant, your model is biased. This often meansyou are missing one or more key explanatory variables. Self check: how do you know that the Jarque-Bera Statistic is NOTstatistically significant in this case?, you want to check model performance:The adjusted R squared value ranges from 0 to 1.0 and tells you how much of the variation in your dependent variable has beenexplained by the model. Generally we are looking for values of 0.5 or higher, but a “good” R2 value depends on what we aremodeling. Self Check: go back to the screen shot of the OLS model that only used Population to explain call volume. What wasthe Adjusted R2 value? Does the Adjusted R2 value for our new model (4 variables) indicate model performance has improved?The AIC value can also be used to measure model performance. When we have several candidate models (all models must havethe same dependent variable), we can assess which model is best by looking for the lowest AIC value. Self Check: go back to thescreen shot of the OLS model that only used Population. What was the AIC value? Does the AIC value for our new model (4variables) indicate we improved model performance?y (but certainly NOT least important), you want to make sure your model residuals arefree from spatial autocorrelation (spatial clustering of over and under predictions).We used the Spatial Autocorrelation tool above and found that our model passes this check too. This will not always be the casewhen you build your own regression models, the Regression Analysis Basics online documentation, and look for the table called “How Regression Models Go Bad”. Inthis table there are some strategies for how to deal with Spatially Autocorrelated regression residuals:Self ch eck: run OLS on alternate models. Use “Calls” for your dependent variable, with other variables in the ObsData911Callsfeature class for your explanatory variables (you might select Jobs, Renters, and MedIncome, for example). For each model, gothrough the 6 checks above to determine if the model is properly specified. If the model fails one of the checks, look at the“Common Regression Problems, Consequences, and Solutions” table in the “Regression Analysis Basics” document shownabove to determine the implications and possible 7: Running GWROne OLS diagnostic w e didn’t say very much about, is the Koenker the Koeker test is statistically significant, as it is here, it indicates relationships between some or all of your explanatoryvariables and your dependent variable are non-stationary. This means, for example, that the population variable might be animportant predictor of 911 call volumes in some locations of your study, but perhaps a weak predictor in other er you notice that the Koenker test is statistically significant, it indicates you will likely improve model results by moving toGeographically Weighted good news is that o nce you’ve found your key explanatory variables using OLS, running GWR is actually quite simple. Inmost cases, GWR will use the same dependent and explanatoryvariables you used in the Geographically Weighted Regression tool with the following parameters (open the side panel help and review theparameter descriptions):o Input feature class: ObsData911Callso Dependent variable: Callso Explanatory variables: Pop, Jobs, LowEduc, Dst2UrbCeno Output feature class:C: Kernal type: ADAPTIVEo Bandwidth method: AICs (you will let the tool find the optimal number ofneighbors)Notice the output from GWR:Neighbors : 50ResidualSquares : 7326.2793171502362EffectiveNumber : 19.863531396247254Sigma : 10.44629989196762AICc : 674.65R2 : 0.89572753438054042R2Adjusted : 0.86642979248431506GWR found, applying the AICc method, that using 50 neighbors to calibrate each local regression equation yields optimal results(minimized bias and maximized model fit). Notice that the Adjusted R2 value is higher for GWR than it was for our best OLSmodel (OLS was 83%; GWR is almost 86.6%). The AICc value is lower for the GWR model. A decrease of more than even 3points indicates a real improvement in model performance (OLS was 680; GWR is 674).Close the progress window. Notice that, like the OLS tool, GWR default output is a map of model residuals. Do the over andunder predictions appear random? It’s a bit difficult to tell. Run the Spatial Autocorrelation tool on the Standardized Residuals inthe Output Feature Class:Close the Progress Window, then double click on the HTML Report in the Results Window to see that the residuals do, in fact,reflect a random spatial the table for the ResultsGWR output feature class and notice several fields with names beginning with “C”. These are thecoefficient values for each explanatory variable, for each feature.

本文标签: 教程回归分析