Tuesday, April 28, 2015

Regression Analysis

Part 1

The null hypothesis would be that there is no linear association between free lunches and number of crimes. The alternative would be that there is a linear association between free lunches and crime.
According to the regression analysis in SPSS I would reject the null hypothesis because the significance value is less than .05 it is .005 (Figure 1). So according to this there is a linear association between lunches and the number of crimes. That linear association isn't very strong however because the R square value is only .173 or 17.3% variation. Just because there is a linear association between these two variables doesn't mean one causes the other. Association does not prove causation.
If we want to figure out what percent of people will recieve a free lunch if the crime rate is 79.7 the following equation is used. All of the numbers are extracted from the regression chart below (Figure 1).
Step 1.  y=21.819+1.685x 

Step 2. 79.7=21.819+1.685x 

Step 3. =34.4%

There is little confidence that this number is very accurate because the R squared value is so small.
Figure 1 Regression analysis chart used to look if there is a linear association between the number of free lunches and number of crimes.

Part 2

Introduction

The UW System wants to analyze enrollment numbers for all the UW schools and try to figure out if there are specific reasons students are choosing particular schools. This can't be figured exactly but they want to know if there are any trends. We are given the attendance numbers of all the schools by county as well as data about percent with BS degrees,income and distance from each counties center (Figure 2). Using statistical linear regression this data will be explored to see what trends if any appear pertaining to students choosing particular schools over another.
Figure 2 is a sample of the data provided to do this project

Methods

The first step in completing the project was some data manipulation and rearranging to get it in a more user friendly format. One of these manipulation was to take the county population data and the distance from the university data and normalize it. In order to do this the population data was divided by the distance from the schools. This was done in MSExel and added to the data in Figure 2. Figure 3 is the new data created by this normalization process.
Figure 3 is the normalized data sets 
After that new data was created the next step was to run regression analysis for two schools. I chose UW-Eau Claire and UW-Green Bay. This analysis is done in a statistical analysis program called SPSS. This program saves an incredible amount of time and stress. Running regression analysis is easy to do all that is needed is a dependent and independent variable. I this case the enrollment number data of each school is the dependent variable and the other variables are the independent. For example dependent would be students enrolled and independent would be Median household income. After the dependent and independent are placed into SPSS linear regression it spits out charts like those below in Figures 4-10.
Figure 4 is the regression analysis for the UWEC enrollment and the pop/dist data. Looking at the chart it is obvious that there is a linear association between these variables because the significance is .000 which is less than .05. The next thing to look at is the R square value which tells us how strong that linear association is and in this case it is very strong. It is .945 which is very close to 1 or 100% positive of how strong the association is. This is an important variable to look at.

Figure 5 is the regression analysis for the UWEC enrollment and the BS degree percentage. Looking at the chart it is obvious that there is a linear association between these variables because the significance is .003 which is less than .05. Again looking at the R square value which tells us how strong that linear association is in this case it is quite weak. It is .121 which is close to 0 or 0% positive of how strong the association is. This is an important variable to look at even though the R square value is small.

Figure 5 is the regression analysis for the UWEC enrollment and the median household income. Looking at the chart it is obvious that there is not a linear association between these variables because the significance is .104 which is more than .05. For the purpose of this study this is no longer an important variable.

Figure 6 is the regression analysis for the UWGB enrollment and the pop/dist data. Looking at the chart it is obvious that there is a linear association between these variables because the significance is .000 which is less than .05. The next thing to look at is the R square value which tells us how strong that linear association is and in this case it is very strong. It is .961 which is very close to 1 or 100% positive of how strong the association is. This is an important variable to look at.
Figure 7 is the regression analysis for the UWGB enrollment and the BS degree percentage. Looking at the chart it is obvious that there is not a linear association between these variables because the significance is .085 which is more than .05. For the purpose of this study this is no longer an important variable.

Figure 8 is the regression analysis for the UWGB enrollment and the median household income. Looking at the chart it is not as obvious that there is a linear association between these variables as the others above. The significance is .044 which is less than .05 but just barely less. There is an association but it isn't as clearly defined. When looking at the R squared value which is only .057 which is very low shows the strength of this association is very low and that our confidence is also very low. This variable still is important because there is a linear association however it is rather week compared to the previous variables.
Once it was determined which variables were still relevant to the study the next step was to run the regression on these variables again but this time have SPSS create a new column in the MSExcel sheet where it will place the residuals for each regression separated by county. This information is important because without it mapping the results would not be possible. Figure 9 shows the new table after these residual columns are added. The numbers in these columns are basically a calculation of how far from the average each county is. Another way to think about it is how many standard deviations from the mean each county is for the particular variable. Negative numbers mean that that county is coming in below the average if less than -1, numbers close to zero between -1 and 1 are average and numbers above 1 are above the average.
Figure 9 showing the standard deviations by county for the important variables

Once we have this information the next step was to map the results. In order to do this the final data table is imported into ArcMap where it is joined with a shape file of the counties in Wisconsin. Once these files are joined you can change the symbology to display the values in figure 9 for each variable. Figures 10,11,12 and 13 are the resulting maps of important variables.

Results

Figure 10 is the showing the comparison of UWEC enrollment and the county populations and distance. The area in dark red are counties that are sending many more students to UWEC than the average of Wisconsin counties. This could be for two separate reasons. The first is that for the distance the county is from campus it is sending more students than the other counties at a similar distance. The other reason is that based on the population of a county it is again sending more students to UWEC than other counties. The light color counties are the other extreme where for the same two reasons there are many fewer students coming from these counties. The two middle shades represent counties that are sending an average amount or slightly above average number of dependents to UWEC based on distance and population. There is a pretty clear pattern in this map that suggests counties close to Eau Claire are sending more students than average. This makes sense because many students stay relatively close to home for at least the first few years of college so it makes sense that Eau Claire county and others very near by would send more students. The counties sending less may have a college of there own like Dunn county which has UW Stout. The counties sending more seem to have more heavily populated cities in them so naturally there is a better chance of more students coming from there.  
Figure 10 is a map of the UWEC enrollment compared to the county populations and distances

Figure 11 is the showing the comparison of UWEC enrollment and BS degree percentage.Looking at the map the dark blue counties are counties sending more that the average amount of students to UWEC. The next two lighter shades are close to average and the counties lighter than that are sending fewer than average. A pattern I notice in this map are that counties that are more focused on industry and that are more urban are sending more students to UWEC. This makes sense because these counties probably also have more people who have college degrees and then send their kids to college to do the same. Counties that are mainly agricultural will most likely have fewer people with college degrees living there and in turn have less of a chance of sending their kids to college.
Figure 11 is a map of the UWEC enrollment compared to the BS degree percentage

Figure 12 is the showing the comparison of UWEC enrollment and the county populations and distance. Like the map of population and distance compared to enrollment at UWEC (Figure 10) there is a distinct patter in this map. The counties closest to Green Bay are sending way more students to UWGB than counties in the rest of the state. This makes sense again because many students stay close to home. Another reason is that UWGB offers some more trade type courses and degrees that are very applicable to the industries in Green Bay and surrounding counties. There are less large colleges in this part of the state as well so the options of where the students can go close to home are more limited. In western Wisconsin you have 4 colleges within an hour to an hour and a half of each other so students have more options but can still be close to home.With all those choices in western Wisconsin it is much less likely the students will go to UWGB which is clearly represented by the map.
Figure 12 is a map of the UWGB enrollment compared to the county populations and distances

Figure 13 is the showing the comparison of UWEC enrollment and the median household income of the counties. From this map I get the impression that counties of low to moderate household income send more students to UWGB. You can see that the counties surrounding the Milwaukee area and close to the Twin Cities send very few students to UWGB. I think this is because people in these counties can afford to send their children to more prestigious and expensive colleges elsewhere. UWGB is a fairly reasonably priced school as far as large universities go and because the majority of people in Wisconsin are middle class with quite a few lower than that they go to the most affordable schools. I think you would see a very similar pattern if this was done with UWEC enrollment.
Figure 13 is a map of UWGB enrollment compared to the median household income of the counties

Conclusion

Based on the data given in this project and the regression analysis and resulting maps I would say that distance from a school, income, and BS degree percentage are all pretty good variables and factors to consider when looking for patterns of why students go to one school instead of another. Like I said earlier alot of students like to stay close to home so colleges in their area seem to be more desirable. The more people a county has the better the chance there is of more students attending a particular college that not really a pattern it is just common sense. House hold income definitely plays a role in choosing where to go to college. In Wisconsin where people aren't overly rich UWGB and UWEC are both good options because they are affordable schools with a good quality education. The rich people will send their kids out of state or to the more prestigious schools such UW Milwaukee or UW Madison. History has a way of repeating itself and this applies to people who get college degrees. If they went to school and got a degree there is a much better chance their kids will also go to school and get a degree. Areas of agriculture and more rural areas usually have fewer people with degrees so you would expect fewer students coming from these counties. Centers of industry and urban areas have doctors and PhD's and others with degrees so we expect these counties to send more students to the universities to get degrees. Overall I think that all these variables point to patterns in why students chose a particular school but many more variables should be considered before saying that there definitely is a pattern.

No comments:

Post a Comment