Tuesday, April 28, 2015

Regression Analysis

Part 1

The null hypothesis would be that there is no linear association between free lunches and number of crimes. The alternative would be that there is a linear association between free lunches and crime.
According to the regression analysis in SPSS I would reject the null hypothesis because the significance value is less than .05 it is .005 (Figure 1). So according to this there is a linear association between lunches and the number of crimes. That linear association isn't very strong however because the R square value is only .173 or 17.3% variation. Just because there is a linear association between these two variables doesn't mean one causes the other. Association does not prove causation.
If we want to figure out what percent of people will recieve a free lunch if the crime rate is 79.7 the following equation is used. All of the numbers are extracted from the regression chart below (Figure 1).
Step 1.  y=21.819+1.685x 

Step 2. 79.7=21.819+1.685x 

Step 3. =34.4%

There is little confidence that this number is very accurate because the R squared value is so small.
Figure 1 Regression analysis chart used to look if there is a linear association between the number of free lunches and number of crimes.

Part 2

Introduction

The UW System wants to analyze enrollment numbers for all the UW schools and try to figure out if there are specific reasons students are choosing particular schools. This can't be figured exactly but they want to know if there are any trends. We are given the attendance numbers of all the schools by county as well as data about percent with BS degrees,income and distance from each counties center (Figure 2). Using statistical linear regression this data will be explored to see what trends if any appear pertaining to students choosing particular schools over another.
Figure 2 is a sample of the data provided to do this project

Methods

The first step in completing the project was some data manipulation and rearranging to get it in a more user friendly format. One of these manipulation was to take the county population data and the distance from the university data and normalize it. In order to do this the population data was divided by the distance from the schools. This was done in MSExel and added to the data in Figure 2. Figure 3 is the new data created by this normalization process.
Figure 3 is the normalized data sets 
After that new data was created the next step was to run regression analysis for two schools. I chose UW-Eau Claire and UW-Green Bay. This analysis is done in a statistical analysis program called SPSS. This program saves an incredible amount of time and stress. Running regression analysis is easy to do all that is needed is a dependent and independent variable. I this case the enrollment number data of each school is the dependent variable and the other variables are the independent. For example dependent would be students enrolled and independent would be Median household income. After the dependent and independent are placed into SPSS linear regression it spits out charts like those below in Figures 4-10.
Figure 4 is the regression analysis for the UWEC enrollment and the pop/dist data. Looking at the chart it is obvious that there is a linear association between these variables because the significance is .000 which is less than .05. The next thing to look at is the R square value which tells us how strong that linear association is and in this case it is very strong. It is .945 which is very close to 1 or 100% positive of how strong the association is. This is an important variable to look at.

Figure 5 is the regression analysis for the UWEC enrollment and the BS degree percentage. Looking at the chart it is obvious that there is a linear association between these variables because the significance is .003 which is less than .05. Again looking at the R square value which tells us how strong that linear association is in this case it is quite weak. It is .121 which is close to 0 or 0% positive of how strong the association is. This is an important variable to look at even though the R square value is small.

Figure 5 is the regression analysis for the UWEC enrollment and the median household income. Looking at the chart it is obvious that there is not a linear association between these variables because the significance is .104 which is more than .05. For the purpose of this study this is no longer an important variable.

Figure 6 is the regression analysis for the UWGB enrollment and the pop/dist data. Looking at the chart it is obvious that there is a linear association between these variables because the significance is .000 which is less than .05. The next thing to look at is the R square value which tells us how strong that linear association is and in this case it is very strong. It is .961 which is very close to 1 or 100% positive of how strong the association is. This is an important variable to look at.
Figure 7 is the regression analysis for the UWGB enrollment and the BS degree percentage. Looking at the chart it is obvious that there is not a linear association between these variables because the significance is .085 which is more than .05. For the purpose of this study this is no longer an important variable.

Figure 8 is the regression analysis for the UWGB enrollment and the median household income. Looking at the chart it is not as obvious that there is a linear association between these variables as the others above. The significance is .044 which is less than .05 but just barely less. There is an association but it isn't as clearly defined. When looking at the R squared value which is only .057 which is very low shows the strength of this association is very low and that our confidence is also very low. This variable still is important because there is a linear association however it is rather week compared to the previous variables.
Once it was determined which variables were still relevant to the study the next step was to run the regression on these variables again but this time have SPSS create a new column in the MSExcel sheet where it will place the residuals for each regression separated by county. This information is important because without it mapping the results would not be possible. Figure 9 shows the new table after these residual columns are added. The numbers in these columns are basically a calculation of how far from the average each county is. Another way to think about it is how many standard deviations from the mean each county is for the particular variable. Negative numbers mean that that county is coming in below the average if less than -1, numbers close to zero between -1 and 1 are average and numbers above 1 are above the average.
Figure 9 showing the standard deviations by county for the important variables

Once we have this information the next step was to map the results. In order to do this the final data table is imported into ArcMap where it is joined with a shape file of the counties in Wisconsin. Once these files are joined you can change the symbology to display the values in figure 9 for each variable. Figures 10,11,12 and 13 are the resulting maps of important variables.

Results

Figure 10 is the showing the comparison of UWEC enrollment and the county populations and distance. The area in dark red are counties that are sending many more students to UWEC than the average of Wisconsin counties. This could be for two separate reasons. The first is that for the distance the county is from campus it is sending more students than the other counties at a similar distance. The other reason is that based on the population of a county it is again sending more students to UWEC than other counties. The light color counties are the other extreme where for the same two reasons there are many fewer students coming from these counties. The two middle shades represent counties that are sending an average amount or slightly above average number of dependents to UWEC based on distance and population. There is a pretty clear pattern in this map that suggests counties close to Eau Claire are sending more students than average. This makes sense because many students stay relatively close to home for at least the first few years of college so it makes sense that Eau Claire county and others very near by would send more students. The counties sending less may have a college of there own like Dunn county which has UW Stout. The counties sending more seem to have more heavily populated cities in them so naturally there is a better chance of more students coming from there.  
Figure 10 is a map of the UWEC enrollment compared to the county populations and distances

Figure 11 is the showing the comparison of UWEC enrollment and BS degree percentage.Looking at the map the dark blue counties are counties sending more that the average amount of students to UWEC. The next two lighter shades are close to average and the counties lighter than that are sending fewer than average. A pattern I notice in this map are that counties that are more focused on industry and that are more urban are sending more students to UWEC. This makes sense because these counties probably also have more people who have college degrees and then send their kids to college to do the same. Counties that are mainly agricultural will most likely have fewer people with college degrees living there and in turn have less of a chance of sending their kids to college.
Figure 11 is a map of the UWEC enrollment compared to the BS degree percentage

Figure 12 is the showing the comparison of UWEC enrollment and the county populations and distance. Like the map of population and distance compared to enrollment at UWEC (Figure 10) there is a distinct patter in this map. The counties closest to Green Bay are sending way more students to UWGB than counties in the rest of the state. This makes sense again because many students stay close to home. Another reason is that UWGB offers some more trade type courses and degrees that are very applicable to the industries in Green Bay and surrounding counties. There are less large colleges in this part of the state as well so the options of where the students can go close to home are more limited. In western Wisconsin you have 4 colleges within an hour to an hour and a half of each other so students have more options but can still be close to home.With all those choices in western Wisconsin it is much less likely the students will go to UWGB which is clearly represented by the map.
Figure 12 is a map of the UWGB enrollment compared to the county populations and distances

Figure 13 is the showing the comparison of UWEC enrollment and the median household income of the counties. From this map I get the impression that counties of low to moderate household income send more students to UWGB. You can see that the counties surrounding the Milwaukee area and close to the Twin Cities send very few students to UWGB. I think this is because people in these counties can afford to send their children to more prestigious and expensive colleges elsewhere. UWGB is a fairly reasonably priced school as far as large universities go and because the majority of people in Wisconsin are middle class with quite a few lower than that they go to the most affordable schools. I think you would see a very similar pattern if this was done with UWEC enrollment.
Figure 13 is a map of UWGB enrollment compared to the median household income of the counties

Conclusion

Based on the data given in this project and the regression analysis and resulting maps I would say that distance from a school, income, and BS degree percentage are all pretty good variables and factors to consider when looking for patterns of why students go to one school instead of another. Like I said earlier alot of students like to stay close to home so colleges in their area seem to be more desirable. The more people a county has the better the chance there is of more students attending a particular college that not really a pattern it is just common sense. House hold income definitely plays a role in choosing where to go to college. In Wisconsin where people aren't overly rich UWGB and UWEC are both good options because they are affordable schools with a good quality education. The rich people will send their kids out of state or to the more prestigious schools such UW Milwaukee or UW Madison. History has a way of repeating itself and this applies to people who get college degrees. If they went to school and got a degree there is a much better chance their kids will also go to school and get a degree. Areas of agriculture and more rural areas usually have fewer people with degrees so you would expect fewer students coming from these counties. Centers of industry and urban areas have doctors and PhD's and others with degrees so we expect these counties to send more students to the universities to get degrees. Overall I think that all these variables point to patterns in why students chose a particular school but many more variables should be considered before saying that there definitely is a pattern.

Thursday, April 9, 2015

Correlation and Spatial Autocorrelation

Part I: Correlation

Figure 1

Figure 2

Looking at the scatter plot (Figure 1) showing the correlation between distance and sound level there is  a clear pattern. Before even looking at the correlation chart it is easy to see there is a strong negative correlation in the scatter plot and use of a trend line makes this relationship easier to see. Interpreting the scatter plot there is a clear reduction in noise the farther away you are from the source. The negative correlation is that as distance increases noise levels decrease. If the pattern wasn't so easy to see on the scatter plot then looking at the correlation chart would be very helpful. A Pearson correlation test was run and that value is what shows the relationship between the two variables. The value resulting from the test is -.896 which shows both the strength and direction of the correlation. The negative value represents the negative relationship I described earlier as seen in the scatter plot: larger the distance the smaller the noise value. The number -.896 is very close to negative 1 so this means that there is a strong to very strong negative linear correlation between the variables. The null hypothesis would be that there is no linear correlation between the sound level and distance from the source. The alternative would be that there is a linear correlation between sound level and distance from the source. When looking at the significance value .000 is less than .05 which means the null hypothesis should be rejected meaning there is a linear correlation between the two variables.


Figure 3
 
According to this chart the white population are very likely to make more money and not be below the poverty line. It shows there is a strong negative linear correlation between white population and below the poverty level. The significance value is below .05 which means that the null hypothesis should be rejected which is that there is no linear correlation between white population and the poverty line. That means there is a liner correlation between white population and the poverty line. For general trends that I see it appears that the white population is better off in basically every category in this chart. They go to high school and graduate, many of them get a college education, and most of them are above the poverty line. When looking at the Black and Hispanic population the chart shows that both are less likely to go to high school, graduate high school, go to college and be above the poverty line. The overall trend is that the white population has a better education and more money.

Part II: Spatial Autocorrelation 

Introduction

For this portion of the assignment we were given data from the Texas Election Commission (TEC) for the 1980 and 2008 presidential elections. The data included was percent Democratic votes for both elections as well as the voter turn out for each of the elections. Data was also needed from the U.S. Census where we downloaded the Hispanic populations at the county level. With the data we were given we are supposed to look at the patterns of the elections and determine if there are clustering of voter patterns as well as voter turnout. They are interested in whether or not the election patterns have changed over the last 20 years. The analysis will be done through use of GeoDa and SPSS two statistical computing programs.

Methods

The first step in the analysis was to bring a shapefile of the counties in Texas into ArcMap that was downloaded from the U.S. Census website. The Hispanic census data was also downloaded. Once this was done, for mapping purposes the voting data had to be joined with the shapefile in ArcMap. Then in order to look at patterns in voter turnout and changes over time weights had to be set so that spatial autocorrelation. This allowed to make map LISA cluster maps in GeoDa to make visual representations of the voting data. Before the maps were made Moran's I was used to created scatter plots of each of the voting data sets. These scatter plots allow us to visualize the patterns or correlation,if there are any, between the different data sets. Below are the results from the spatial autocorrelation and LISA cluster maps.

Results

Figure 4 Scatter plot of Hispanic Population in 2010
Figure 5 LISA map of Hispanic population 2010
Looking at the  Moran's I (Figure 4) in connection with spatial autocorrelation and the Hispanic population in 2010 there is definitely a cluster pattern. The .77 is getting close 1 which means that there are many areas in this case counties in Texas where you have High High or Low Low situations. In other words there are many counties either high or low in Hispanic population next to other counties with the same high or low population. On the map (Figure 5) there are areas of high Hispanic population clustering (red) in the southern part of Texas along the border and there are areas of low Hispanic population clusters(blue) in the north eastern part of the state. The white area of Texas are counties that don't have a large population or small population of Hispanics they are the more mixed counties.
Figure 6 Scatter plot of percentage of Democratic vote in 2008
Figure 7 LISA map of percentage of Democratic vote 2008
Looking at the percentage of Democratic vote in 2008 again there is a pretty high Moran's I value which means there is a decent amount of clustering going on. In this case it would be areas of high democratic vote (red) next to others with high votes and areas of low Democratic vote (blue) next to other areas of low vote. There a few counties of in the lighter colors which mean there is an area of high Democratic vote next to a county of low vote (light red) or the other way around (light blue). The white areas are counties with a mixed vote. We see that the southern part of the map votes primarily Democratic while the northern section tends to be low in the Democratic vote.

Figure 8 Scatter plot of voter turnout in 2008

Figure 9 LISA map of voter turnout 2008
The stat being considered is the voter turnout by county. Looking first at the Moran's I number it shows that there is less clustering of this data going on than in the previous two data sets. There is an interesting pattern I see between this map (Figure 9) and the previous map (Figure 7). In the southern tip of Texas we see a large cluster of low voting turnout on this map and on the previous that very same area had a high percentage of Democratic votes. One possible explanation for this pattern is that this is an area of agriculture and the farmers do not want to drive the distance to a voting location so there is a low turn out but the people who do vote vote Democratic possibly in support of immigration for employees to work on their farms.


Figure 10 Scatter plot of percentage of Democratic vote in 1980

Figure 11 LISA map of percentage of Democratic vote 1980
The next data set is the percentage of Democratic vote in 1980. According to the Moran's I test there was more clustering and spatial autocorrelation taking place in connection with Democratic vote for the 1980 election than there was for the 2008 election. One possible reason for this is the changing view of people on who they want in office and what values they should have. This change over time leads to more counties that are mix in the votes and neither high or low Democratic vote. Comparing figure 11 and figure 7 we see that the areas of clustering are pretty similar. The high Democratic votes are in the south especially the tip of Texas. The low Democratic votes are in the northern part of the state, they have moved eastward a bit however. The areas of high Democratic vote have moved out of the eastern part of the state from 1980 towards the west and the boarder by 2008.

Figure 12 Scatter plot of voter turnout in 1980
Figure 12 LISA map of voter turnout 1980

Finally the results of the 1980 voter turnout are as shown in figures 11 and 12. Comparing figure 12 in 1980 to figure 9 in 2008 there is more clustering happening in 1980 than in 2008. The southern tip stays consistent with very low turn out values and the northern part of the state has a high turnout in both elections but less clustering in 2008. The central part of the state has areas of high turnout in 1980 and even more so in 2008.

Conclusion

Based on the LISA maps and the scatter plots created to explore the above variables a couple of patterns were observed. First looking at the the Hispanic population in 2010 (Figure 5) it is very obvious that in both elections 1980 (Figure 11) and 2008 (Figure 7) that areas of high Hispanic occupancy match up very well with areas of high Democratic vote. This area is mainly done in the southern tip of Texas which and could all be fueled by agriculture. The more farms there are the more jobs are available and which attracts the Hispanic population who need to support their family so you get a cluster of high Hispanic population. The farmers in these areas tend to vote Democratic possibly to keep their workers at the farms and keep their cheap labor. Another pattern that makes sense is a low voter turn out in this same area for both elections. These counties are full of migrant workers who can not vote or choose not to. So even though these counties may have lots of people in them the voter turnout will be low. Overall it could be said that the concentration of Hispanic population in Texas is much more concentrated in the southern part of the the state than the north.
The clustering patterns for both elections are fairly similar with low voter turn out in the south higher turnouts in the middle and northern part of the state. The Democratic clustering is mostly in the south as well where the north has low Democratic clustering. Looking at all three of the variables together there seems to be a overall pattern emerging. That pattern is that the higher the Hispanic population there is in an area the more Democratically supportive the area is and the lower the voter turnout will be (Figure 13). So voter turnout and Hispanic population have a negative correlation or as the Hispanic population increases the number of voters decreases (figure 14).  With the Democratic vote we see a positive correlation where, as the Hispanic population increases so does the percent Democratic vote (Figure 15). In figures 13 through 15 we would reject the null and state that there is a linear correlation between the variables. Overall the clustering patterns did not change very much between the two elections. They moved slightly and had slightly higher or lower Moran's I numbers but for the most part the two elections show similar clustering patter of voter turnout and Democratic vote.
Figure 13 Democratic vote vs voter turnout 2008


Figure 14 Hispanic population vs voter turnout 2008

Figure 15 Hispanic population vs Democratic vote 2008

Monday, March 16, 2015

T and Z tests and Chi-Squared Testing

Question 1
2a.
Z= (3.2-4)/(.73/squrt. 50)  Z= -7.49 CV =1.96        
Z= (11.7-10)/(1.3/sqrt. 50) Z= 9.24 CV = 1.96
Z= (77-75)/(5.71/sqrt. 50) Z= 2.47 CV= 1.96
b.
The null hypothesis is that there is no difference between the average number of Asian beetles from the county to the state level. The alternative hypothesis is that there is a difference between the average number of Asian beetles from the county to the state level.
I reject the null hypothesis and fail to reject the alternative so there is a difference between the number of Asian beetles at the county and state level. I say this because I got -7.49 for my z score and 1.96 critical value which does not fit the distribution graph.
The null hypothesis is that there is no difference between the average number of emerald ash borer beetles from the county to the state level. The alternative hypothesis is that there is a difference between the average number of emerald ash borer beetles from the county to the state level.
I reject the null hypothesis and fail to reject the alternative so there is a difference between the number of emerald ash borer beetles at the county and state level. I say this because I got 9.24 for my z score and 1.96 critical value which does not fit the distribution graph.

The null hypothesis is that there is no difference between the average number of golden nematode  from the county to the state level. The alternative hypothesis is that there is a difference between the average number of golden nematode from the county to the state level.
I reject the null hypothesis and fail to reject the alternative so there is a difference between the number of emerald ash borer beetles at the county and state level. I say this because I got 2.47 for my z score and 1.96 critical value which does not fit the distribution graph.
3. 
The null hypothesis is that there is no difference between the number of people per party in intervening years. The alternative hypothesis is that there is a difference between the number of people per party in intervening years.
I reject the null hypothesis and fail to reject the alternative  so there is a difference between the number of people in the park in intervening years. I say this because I got 4.92 for my t score and 1.711 critical value which is outside the .05 confidence value range.

Introduction

This assignment was all about visually and statistically comparing "Northern" and "Southern" Wisconsin. I placed them in quotation marks because there no exact measure of what the north part and south part of Wisconsin are. We were presented with a theoretical situation as follows. The tourism board of Wisconsin has asked you to conduct a bit of research regarding the concept of "Up-North." We were provided with a large variety of data from which we were to chose 3 variables to explore. On those 3 variables they want us to conduct a Chi-Squared test. The Chi-Squared test helps us to statistically compare counties north of highway 29 and counties to the south of highway 29. We also will compare the counties through maps based on the 3 variables we chose.

Methods

Part 1

For the first portion of this assignment we created a variety of maps. The first map we were asked to create is one that divides the counties in the state into two groups: counties north of highway 29 and counties south of highway 29.  In order to do this I brought in a street map and zoomed into the state of Wisconsin to locate highway. I then brought in a shape file of the counties in Wisconsin and laid it over the street map. I turned the transparency up to 70% so I could see the street map though the counties. Then looking at the counties position in relation to highway 29 I assigned a 1 to counties north and a 2 to counties to the south. To do this I added a new column in the Excel spread sheet of all the county data. (Figure 1)

Figure 1
The next step was to bring this excel table into ArcMap. To do this I just hit the add data button and select my file. Once me excel sheet was in ArcMap the next step is to join it with the county shape file. By doing this I will be able to map the data in my excel sheet. After I joined the excel and shape file the next step was to chose me 3 variables from the data provided for us. I chose gun deer and bow deer licenses sold and miles of ATV trails per county. Having chosen these variables I created 3 new fields in the attribute table of the county shape file. (Figure 2)

Figure 2
Once the fields were created assigning a ranking of 1-4 to each field based on its quantities began.  To do this I went to the provided data column like tr_ATV which has the ATV trail miles by county and hit the statistics button. This tells me info about the field like sum and max or min values. I am interested in the max value. I took this number divided it by 4 and then subtracted that from the max value 3 times to get my 4 rankings. (Figure 3) I then took my 4 numbers and in the select by attributes option in the county attribute table I entered each with a < symbol in front of them. Then in my fields I created I entered a 1-4 based on the number range selected in the select by attributes. When I was done my table looked like this. (Figure 4)
Figure 4


Figure 3

Once these ranks were assigned I was ready to map my results. This is easy to do. In the symbology tab for the county shape file just select the feature you want to map and assign a color scheme to it. In the legend the ranks of 1-4 are still there but I changed the labels so that instead of the ranks the actual numerical values are displayed. The maps below are my results of the 3 chosen variables and the north south map.

Results

This a map showing the northern and southern counties of Wisconsin based on their location relative to highway 29. Anything north was north anything south was south and the counties that 29 go through were determined by looking at whether more of the county was north or south of 29. This is my version others who did this could have different counties in the north or south based on their interpretation. From the map we can see that 29 does a pretty good job of dividing the state in half top to bottom spatially.

Figure 5

This next map is a chloropleth map displaying the number of gun deer licenses bought per county. We can see that the majority of counties has a fairly low license purchase. The northern part of the state especially seems to have low values. Geographically speaking I think this is caused by the amount of wilderness and forest up in the area and less densely populated towns. The less people there are the fewer licenses sell. I don't think this represents the number of hunters in these areas however because many people buy the licenses in a different county and then drive to these areas to hunt. Overall I would say there is a higher number of licenses bought in the southern part of the state but I think that is because there are more people there to buy them.


Figure 6

Looking at bow deer license purchases we see a similar pattern as the gun deer purchases. Again the northern part of the state has less in general but there is a slight increase of counties with higher purchase rates. The southern part of the state is pretty much the same.


Figure 7
This final map is looking at the number of miles per county of ATV trails. As I expected there are more ATV trails in the northern part of the state. I think this has a lot to do with the fact that there are more snowmobile trails up north as well because there is usually more snow. These trails get used for snowmobiling in winter and ATV riding in summer. There are lees people as well which is necessary for ATV trails because you can't put them through cities or on paved roads. The point of an ATV is to go off-road on all terrains and you need space to be able to do that. There are also more rugged terrains in this part of the state which people enjoy riding more than flat corn fields in the south.

Figure 8
Part 2
After I mapped out the data and made it visually appealing and easy to understand the next part of the assignment was to do statistical analysis. The analysis we were supposed to use is called Chi-Squared. The point of this function is to compare 2 areas based on a variable. In this case the two areas are the previously determined northern and southern Wisconsin counties. In order to perform this function we used a program called SPSS. Once the program is open we bring in our table that we exported from ArcMap containing all the county data. Then we open the crosstabs window. We chose the Chi-squared method and then bring in the North South counties for the rows and one of my 3 variables. You hit ok and figure 9 is the result. I did this test once for each variable so I got 3 different charts. (Figures 9-11)

ATV Chi-Squared Figure 9
Looking at the ATV map we would state the null hypothesis that there is no difference between the amount of ATV trail miles in northern Wisconsin compared to southern. The alternative hypothesis is that there is a difference between the amount of ATV trails. In this case I would fail to reject the alternative hypothesis because we can see from the map and the Chi-Squared value that there is a difference in miles of of ATV trails between northern and southern Wisconsin.
Bow Deer Chi-Squared Figure 10

Looking at the bow deer map we would state the null hypothesis that there is no difference between the amount of bow deer licenses sold in northern Wisconsin compared to southern. The alternative hypothesis is that there is a difference between the amount of licenses sold. In this case I would fail to reject the null hypothesis because looking at the map and the Chi-Squared value we see that there is no difference between the number of bow deer licenses sold in northern and southern Wisconsin.
Gun Deer Chi-Squared Figure 11
Looking at the gun deer map we would state the null hypothesis that there is no difference between the amount of gun deer licenses sold in northern Wisconsin compared to southern. The alternative hypothesis is that there is a difference between the amount of licenses sold. In this case I would fail to reject the null hypothesis because looking at the map and the Chi-Squared value we see that there is no difference between the number of gun deer licenses sold in northern and southern Wisconsin.




Conclusion

From my results I don't think that we can clearly say what is northern and southern Wisconsin. With one of my variables there was a difference between the two parts of the state but for the other 2 variables according to the statistics there was no difference. More variables would have to be tested to get a better idea of northern and southern Wisconsin and what defines them.

Thursday, February 26, 2015

Z Scores, Mean Center, and Standard Distance

Introduction

For this activity we were given tornado width data for the states of Oklahoma and Kansas. One set of data is for 1995 to 2006 and the other for 2007 to 2012. Looking at the spatial distribution and width of the tornados you are given the task of determining whether or not tornado shelters should be installed and if so where those areas are. People in the states are questioning whether or not these shelters are really necessary or a waste of money.This will be determined through the use of mean center, weighted mean center, standard distance, and weighted standard distance.

Methodology

We used several methods to solve this problem. Listed are below are those methods and what they mean when doing this analysis.

Mean Center
Mean center is where the average of the x points and y points occur. In order to find mean center you calculate the average of the x and y values of your data points. You take those values and write it as a coordinate point like (5,4). This point shows the average location of all of your x and y values.

Weighted Mean Center
Weighted mean center is the same procedure as finding the mean center however you can specify a weight. This weight puts more importance on some values more than others which will move the mean center closer to the weighted points. The points aren't all equal value like they are when figure mean center.

Standard Distance
Standard Distance provides an average measure of feature distribution around any given point. It is very similar to the way a standard deviation measures the distribution of data values around the statistical mean but it is used for spatial analysis.

Weighted Standard Distance
This method is very similar to standard distance only you add weights to some of the features as you add weight to points for weighted mean center. An example from this activity would be the larger the tornado width the more the standard distance is going to move towards that features.

Data
All of the above tools were used on the data we were provided with to get our results. The data we received were point feature classes showing tornado locations in Oklahoma and Kansas. It came in two time periods consisting of data from 1995 to 2006 and 2007 to 2012. Not only are the points of each tornado included but also the width of the tornado. We were also given a shape file of Oklahoman and Kansas including county data.

Results

In order to answer the questions of whether or not the tornado shelters are important, in the right place or should get moved we had to use the methods described above. My results are as follows in Figure 1-7.

Figure 1
Figure one shows the location of tornados from 1995 to 2006. The data is displayed in a format called graduated circles. This allows use to see the relative size of each tornado as you can see if you look at the legend for the map. The smaller the circle the smaller the tornado which are measured in feet. I also figured out the weighted and mean center for this data set. As you can see the mean center ends up being pretty close to the middle of the study area. The weighted center moves slightly south which is determined by taking the size of the tornado into consideration not just the location of the tornado. There are more large tornados in the southern part pulling the weighted center in that direction.


Figure 2
 This map shows the tornado locations in graduated circles as well however this is displaying the 2007 to 2012 data. Again like the map above the mean center occurs near the center of the study area. In this case again the weighted mean center moves to the south. This suggests that the southern part of the state had more large tornados during this time period as well.



Figure 3

This map is simply a combination of the graduated circle, mean center and weighted mean center data for both time periods. As you can the weighted mean centers are both slightly to the south from the mean center. This is to be expected based on the first two maps. The probability of larger tornados taking place in the southern part of the state is are represented by this map based on the data we received. The higher probability of larger tornados in this area makes me agree with having tornado shelters in this area even more so than in the upper part of the state. Having data ranging over this long of time helps to illustrate the trend that is occurring: tornados are bigger in the southern part of the state.


Figure 4
 The next method we look at is the standard distance. This calculates the spatial standard deviation of features around a given point. In these maps that point is the weighted mean center. As you can see from this map of the of the 1990 data the majority of the tornados falls within this distance and that it is pretty well centered on the study area per the location other the weighted mean center. When you weight the values by the width of the tornados we see that again the distance moves south ward towards the higher population of tornado occurrences. Contained in these circles is approximately 68% of the total number of tornados.

Figure 5
 Looking at the more recent set of data we see that again the standard distance is in the middle of the study area but expands more to the north to include the 68% of tornados. When the weight is applied we can see that the distance moves south ward but also gets smaller because there is a higher concentration of tornados.


Figure 6
 Again just as before with the mean centers and graduated circles we combine the two maps of standard distances and graduated circles for the two time periods. From this map we can see that the south central part of the state has the highest concentration of tornados.

Figure 7

We also calculated the standard deviation of tornado occurrences in the two states. This chloropleth maps shows the results. The counties with values of -.50 to .50 are closest to the mean. The counties above this range have many more tornados than the average in the states. As you can see the central part of the state has the highest number of tornados and is outside the average of the states. This correlates with the results of figures 1-6 as well.

The z-score is the standard deviation for one particular sample. In this exercise we chose 3 counties to find the z-scores for. The counties were Russel, Caddo, Alfalfa. We found the standard deviation by creating a chloropleth and looking at the statistics related to it. The standard deviation was 4.3 while the mean was 4. Below are the results for the 3 counties.

Russel = 25 tornados with a z score of 4.88
Caddo = 13 tornados with a z score of 2.09
Alfalfa = 4 tornados with a z score of .23

Looking at the z scores we can see that Russel county has many more tornados than the other counties because it is almost 5 more than the average. Alfalfa is right about average with the number of tornados with a z score of .23 barely greater than the average.

Next I found the number of tornados that occur roughly 70% of the time. To do this you find 70% on the z chart which is a z score of .52.  This shows up on the negative side of the graph so we place a - in front of the .52. After doing the calculation I found that 1.76 tornados is the number that should occur 70% of the time.
Next I found the number of tornados that occur roughly 20% of the time. Same as before look at the z chart and instead of 20% we find the 80% which has a z score of .84. Again doing the calculations we find that 7.6 tornados is the number that should occur 20% of the time.

Conclusions

Based on the maps and results of this study I recommend that tornado shelters are placed and kept in the central to south central part of the states. This area has the biggest and most concentrated tornado occurrences compared to the rest of the area. If the trend continues that has been happening over the time period explored here these shelters will need to be put up in a south ward path where the highest threat of tornados will shift to.