Report – Project 3 Report - Project 3
This blog examines the seasonal patterns in the housing market and tourism industry based on a review of median home prices and hotel occupancy rates.
Comprehending these seasonal patterns is not just a theoretical endeavour. These are important insights for tourism-related businesses to have when planning and strategizing. In a similar vein, real estate agents and prospective homeowners can make more informed choices if they are aware of when the market is likely to peak or cool.
This analysis serves as a helpful reminder that seasonal patterns dictate both the housing market and the tourism industry. It provides a nuanced perspective on how various seasons of the year can influence economic activity and emphasizes the significance of timing in both industries.
Our most recent research focuses on trends in foreclosures, an important but sometimes disregarded area of the economy. We may learn more about the stability of the housing market and the state of the economy as a whole by looking at the variations in foreclosure petitions and deeds.
Other economic factors are closely associated with these trends. For example, increased joblessness may result in a greater number of foreclosures, and fluctuations in property values may impact the financial choices made by homeowners.
Excessive foreclosure rates affect community stability and represent actual difficulties that people and families face. They have an impact on more than just numbers.
This analysis provides insight into the housing market as well as a distinct viewpoint on the state of the economy as a whole. As we investigate and interpret these economic trends, stay tuned.
We are examining the economy indicator today, which has a fascinating web of interconnectedness between different components. For example, I’m examining how the volume of passengers at Logan Airport can provide information about hotel occupancy rates and the overall health of business and tourism travel. Another interesting area of study is how the housing and employment markets interact. While a slow job market can cause a decline in real estate activity, a robust job market frequently drives a strong demand for housing. Major development projects also have a notable impact on local economies, demonstrating the way in which these endeavors can stimulate the housing market and create jobs. This article aims to unravel these economic strands and demonstrate how changes in one industry can spread to other areas, presenting a complete picture of our financial environment.
This blog examines the housing market with a particular emphasis on the evolution of median home prices. This trip is a reflection of the economy and involves more than just pricing.
The median home price graph that we examined is comparable to a road map, illustrating the highs and lows of the market. Rising prices frequently indicate a robust economy with confident buyers and a strong demand for homes. Conversely, price declines or plateaus may indicate a cooling of the market, perhaps as a result of shifting consumer attitudes or economic difficulties.
However, these tendencies are not isolated. They are entwined with other economic strands such as interest rates, employment rates, and general state of the economy. For example, a strong job market may increase people’s ability to purchase homes, which would raise prices. Similar to how interest rate fluctuations can influence prices, they can also motivate or deter buyers.
It’s interesting to note that we also observed possible seasonal fluctuations in the housing market. Prices may be slightly impacted by periods of increased activity during the year.
It is essential to comprehend these subtleties in housing prices. It provides information on both the real estate market and the overall state of the economy. Buyers, sellers, investors, and policymakers can all benefit greatly from this analysis, which will help them make well-informed decisions in a constantly changing market.
Using a trend analysis of important economic indicators, we’re going to examine the Boston economy in more detail today. It’s similar to being an economic investigator in that we put together hints to figure out the overall picture.
The unemployment rate, hotel occupancy rates, and median home prices were our three primary points of interest. These all provide us with different insights. The unemployment rate tells us how many people are unemployed, much like a thermometer does for the labor market. It’s excellent news that when this number declines, more people typically have jobs!
We then examined hotel occupancy rate, or how full hotels are. We can see a glimpse of tourism and business travel with this rate. Low occupancy might imply the opposite, but high occupancy frequently indicates more guests and active business activity.
Finally, we investigated the median price of a home. This signal functions somewhat as a window into the housing market. A robust economy may be indicated by rising prices, which can also indicate a high demand for homes. Conversely, a decline in prices or a stagnation of them may indicate a cooling of the market.
We can gauge the state of the economy by examining these patterns.
Various economic statistics are included in the collection, arranged by month and year. An overview of what each column stands for is given below:
In the field of time series analysis, the SARIMA model serves as a foundation. SARIMA (Seasonal Autoregressive Integrated Moving Average), an extension of the ARIMA model, adds another level of complexity to forecasting and is especially helpful when handling seasonal data.
A statistical model called SARIMA forecasts subsequent points in a time series. It excels at processing data with seasonal patterns, such as monthly sales data that peaks around holidays or daily temperature variations from season to season. By incorporating seasonality, the model expands on ARIMA and gains greater adaptability.
Parts:
The components of the SARIMA model are as follows: moving average (MA), autoregressive (AR), integrated (I), and seasonal (S).
A crucial component of data science is time series analysis, which looks at collections of data points accumulated throughout time in sequence. This technique is essential for forecasting future trends based on historical data in various sectors, including meteorology and economics. This blog aims to make time series analysis more approachable for novices while maintaining its technical foundation.
The study of data points gathered at various times is the focus of time series analysis. It forecasts future trends, finds patterns, and extracts useful statistics. Numerous fields, including weather forecasting, market trends prediction, and strategic business planning, depend on this study.
Relevant Ideas:
The ability to identify long-term movement, seasonality (the ability to identify patterns or cycles), noise (the ability to distinguish random variability), and stationarity (the assumption that statistical properties stay constant over time) are all crucial concepts.
As we move forward with Project 3, we have an abundance of choices because the Analyze Boston website has 246 datasets available. Right now, our team is working to determine which of these options best fits the goals of our project. This selection procedure is essential since it establishes the framework for our analysis that follows. Once a dataset has been chosen, our attention will turn to carefully going over each of its details in order to find a compelling and obvious question that arises from the data. Our analysis will be built around this question, which will help us discover fresh perspectives. Our project is in an exciting phase right now, full of possibilities for exploration as well as obstacles to overcome.
I have gained knowledge about decision trees in today’s class. In essence, decision trees are graphical depictions of decision-making procedures. Consider them as a sequence of inquiries and decisions that culminate in a decision. You start with the first question on the tree, and as you respond to each one, you move down the branches until you reach the final choice.
Choosing the most instructive questions to pose at each decision tree node is a necessary step in the construction process. Based on different characteristics of the data, these questions are chosen using statistical measures such as entropy, Gini impurity, and information gain. The objective is to choose the most pertinent attributes at each node in order to optimize the decision-making process.
Decision trees do have certain drawbacks, though, particularly in situations where the data shows a significant spread or departure from the mean. In our most recent Project 2, we came across a dataset where the mean was significantly off from the majority of data points, which reduced the effectiveness of the decision tree method. This emphasizes how crucial it is to take the distribution and features of the data into account when selecting the best statistical method for analysis. Although decision trees are a useful tool, their effectiveness depends on the type of data they are used on. In certain cases, other statistical techniques may be more appropriate for handling these kinds of scenarios.
A clustering technique called K-means seeks to divide a set of data points into a predetermined number of groups, or “clusters.” The first step of the procedure is to choose “k” beginning points, or “centroids,” at random. The closest centroid is then allocated to each data point, and new centroids are recalculated using the cluster average of all the points. Until the centroids no longer vary noticeably, recalculating the centroids and allocating points to the nearest centroid is repeated. As a result, there exist “k” clusters, or groups of data points closer to one another than they are to points in other clusters. The number “k,” which denotes the desired number of clusters, must be entered by the user beforehand.
Data points are grouped using the DBSCAN clustering method according to their density and proximity. Unlike k-means, which require the user to choose the number of clusters ahead of time, DBSCAN analyzes the data to identify high-density zones and distinguishes them from sparse areas. Each data point is given a neighbourhood, and if a sufficient number of points are close to one another (signalling high density), they are regarded as belonging to the same cluster. Low-density zones are considered noise since the data points there are not part of any cluster. Because of this, DBSCAN is particularly helpful for handling noisy data and finding clusters of different sizes and shapes.
Possible dangers:
DBSCAN –
K-values:
We have explored the connections between factors such as age, ethnicity, and perceived danger levels in relation to signs of mental illness in our investigation of fatal police shootings.
Age and Mental Health: The investigation has shown a significant correlation between a person’s age and mental health markers’ existence. A substantial age difference was found using a t-test between participants who showed indicators of mental health and those who did not; the t-statistic was 8.51 and the p-value was nearly zero. This result emphasizes how age and mental health problems are strongly correlated in these cases.
Ethnicity and Mental Health: We first encountered data issues while examining ethnicity, but they were later resolved, and a chi-square test was carried out. The results showed a significant relationship between mental health symptoms and ethnicity, with a tiny p-value of 3.98×10^-35 and a chi-square value of 171.23.
danger perception and mental health: A chi-square statistic of 24.48 and a p-value of 4.82×10^-6 indicate that there is a significant correlation between the perceived danger level and mental health indicators in our study.
In incidents of lethal police contact, the research has illuminated the strong relationships between age, ethnicity, perceived danger level, and mental health markers. These results pave the way for more thorough studies and improve our understanding of these pivotal moments. The distribution of threat levels and their interactions with other variables will be the main focus of our upcoming study phase.
Gaining insights into this pressing issue requires a comprehensive analysis of data spanning from 2015 to 2023. The information indicates a generally stable pattern in the monthly occurrence of events, albeit with minor fluctuations. This demonstrates the enduring nature of the problem over time.
scenario. White individuals, who constitute a significant portion of the U.S. population, are involved in approximately 50.89% of fatal police shootings. The data about the black community is particularly alarming. Despite representing only about 13% of the U.S. population, they account for a disproportionately high 27.23% of fatal police shootings. Hispanics follow, making up around 17.98% of such incidents, while other racial groups, including Asians, Native Americans, and others, are represented in smaller percentages, at 1.99%, 1.62%, and 0.29%, respectively.
This information underscores the critical need for deeper exploration and potential enhancements in policing practices, especially considering the stark disparities in how different racial groups are affected.
When I dug further into the data, I tried to answer a question I had from the beginning about the age difference between blacks and whites.
The age difference between blacks and whites is approximately 7 per cent.
Now the question is: can it really be possible, or is this just a fluke?
Now, when we plot a histogram of the ages of blacks and whites from the data, we can see that the mean of both graphs is different.
We can see that the mean for the grape deviates from the normal, which means the data is not normally distributed.
Now, if we want to know that 7% is a fluke, we can try to do a t-test, but here, because the data is not distributed normally, the t-test can produce a suspicious p-value because the data is not distributed normally.
So, in a case like this, we can use a Monte Carlo method to find the p-value.
In this, what we do is make a pool of data, take random data from the pool and try to estimate a p-value with that
As a result, I observed that the probability of getting an age difference of 7 years is nearly 0, which means statistically there is not a single case in which the age difference was 7 years.
So, for the next step, I try to find a pattern, like the statistics of a person who died who is armed or something like this
Today in class we learned some new functions and a whole new library which is Geopy.
So, what geopy does is, it helps us to locate some geographical location on the globe using some third-party coders.
This could be helpful for plotting a geomap based on data like this
Now here is what we can observe the shooting in coastal areas such as the East Coast and West Coast is very.
Large in numbers
Now there can be several reasons for the same as population, crime rate etc.
One thing in this which I suspected is that it can be a possibility that the number of guns in a state can also affect the data of shootings but in this map, we can see that Texas does not have that many locations as compared to other states in the West coast. this observation be affected by further observation because it can be affected by the population, in Texas alone there are two hotspots, maybe they are because more people live there. I think that if I further dig into this it can answer some types of questions.
The second thing I want to know is, if is there any gang influence in the area where the density of the shootings is high. By this, I can get answers to two of my questions one being whether any gang influence is there or not and the second is if there is then how much that is influencing the youth which can be a justification for early death in some races.
So today to find the answers to the questions from the previous blog I analyzed the data even further.
First, I try to find some type of pattern like is there any special community which is at the top and is there any significant difference between the community at the top or not?
Before answering this question the blog in Washington Post which can be found here, states that the number of deaths in the black community is significant and there is some sort of racial discrimination by the police, but when I analyzed the data I found out that the number of white people died is approximately twice than the number of the black community and have a significant difference between the other communities also.
Here two things can be observed:
Now, one more question I faced is, are there any negative operations present in this data or is this just a normal finding?
In the upcoming days, I try to find the answers to these questions and develop stronger findings.
Today we discussed the next project that we must do, and, in this project, we got some data on police shootings. In this data, it is given how many deaths were there and what were their race.
So initially if we look at the data, we find that the rate of death by police shooting of black people is significantly higher than that of white people.
Why is that? this is one of the answers we try to live by our analysis.
So today in class when we were discussing this project I asked the professor a question,
What are the initial questions we are trying to answer here?
To this, he explained, that we must find the questions by looking at the data, by this he meant that when we initially don’t what we can do just simply try to perform some basic commands on the data and simply just look at the data.
So, after I had done that, I observed that.
By this several questions rises
We try to answer these several questions in this project.
Today we started to write our report and to do so we must do a series of tasks.
Today we are focusing more on the first two steps.
The first thing is collecting all the data and findings from others on this project, like the different approaches they have taken and the results of those approaches.
But the most important aspect of this report is the issue.
So, for today, these are the two things we will tackle, after that, we will move to the next step of this project report.
After trying to find the relation between age and diabetics I found that it is simply not possible because if we want to club the state then we will end up with even less data to do our analysis
So, I tried other things to get some good results with my model like adjusting the data to perform WLS.
But even with all this, I am still not able to get a good result out of this model.
Are we trying to answer back to the CDC that can predict it or not?
If we are trying to make a model then we have not achieved any success till now, how can we decide that yes this is the accuracy we are looking for?
This is a very stupid but very important question for me because it changes the final report so much.
And on the topic of the final report we are parallelly we are starting to write a report because we have to submit a report on upcoming Monday.
Continuing from the previous conversation, I got an idea which may or may not help us to improve the r2 score of the model.
The idea was why not introduce a third independent variable?
So, for this, I turned toward Google and began researching the leading cause of diabetes.
After a long and intense search on Google, I came to know about a very interesting relationship between diabetics and the age of the patient.
What I found is that diabetes and age have a linear relationship with each other. It is found that the number of patients increases as the age of them increases.
From I got the idea of introducing the third independent variable “age factor”
In addition, I found some data on the CDC website but that data is divided based on states and not on the county, so I’ll be modifying the current data.
But I have some doubts which I want to solve with the professor so I will first get some doubts cleared and then I try to test my theory.
In the previous blog, I mentioned that I tried k-fold cross-validation with 10 folds.
Here a question arises why 10 folds and not 4, 3, or 5 folds and does that even matter how many folds you take?
The simple answer is no. It should not matter how many folds you want because it simply doesn’t matter.
It is not a rule or norm that you necessarily have to take 5 folds or 10 folds (like in our case). You can take as many folds as you require.
K-fold is just a way to solve a problem where the training data is limited. It just depends upon what we require for the project at that time.
Now we will understand training error and test error.
In simple words when we apply the model to the data, we trained the model on then it means that we are simply calculating the training error.
In simple words if we apply the model to the data which was unknown when we were training the model (which is the test data, then we are simply calculating the test error.
One more important aspect the professor told us in class today that when we train the model with training data and then test the model on the testing data then at the time of testing if there is more than one similar value is present there then it will not consider them separate and this will affect the accuracy of the model.
In this case, the accuracy that we were getting previously was around 33 % which is questionable.
So, to overcome this what we can do is we can label every data to give a unique identification.
For example, we have data.
[x1, y1.z1] and [x2, y2, z2]
Then what we can do is we can add numbers to this data to give them a unique identification
[1, x1, y1.z1] and [2, x2, y2,z2]
For today I must look at the data and label the data in the hope of getting better accuracy for the model
After making a separate file in which we club all the values from different files like “diabetic”, “Obesity”, and “Inactivity” and named that file as a final.csv in this we contained the ‘county_state’ column which we obtained by merging the counties based on state.
After which with the help of the inner join function we obtain the d=related data from all the files.
After we collected the data, we made a linear model with two independent variables (obesity and inactivity) and one dependent variable(diabetics).
For this, we split the data into two sets.
Here I decided to put 60% of the data into a training set and then the remaining set.
After this, we tried to calculate the performance of our model by looking at its accuracy, but my model was only 22% accurate.
After that, I try to plot some scatter plots like predicted vs. actual values and then the residual plot.
In this plot we observe that the data points are sort of clubbed in the middle, which is not good for a model, it should be scattered.
[plot]
The same can be observed for the residual plot also, here the residuals should be scattered all over the places, but they are all in the middle(mostly)
After this, I tried to apply k-fold cross-validation which we learned today in class.
According to Wikipedia
Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.
In simpler words,
In this, we divide the data into several chunks, then try to train the model with one chunk at a time and take all the remaining chunks to test the model.
So, after I applied k-fold cross-validation on my model with 10 folds I got an r2 score of 33%
Today after attending the doubt session class, I got an idea, why not try to find some relation by taking some more data apart from only these two variables?
After this, I tried to find some data on the official site from where we find this data which can be found here.
With this approach, I tried to find some data like data based on whether the count is an urban county or a rural county (this Idea is from one of my classmates) and based on economics, but after analyzing those data there was no relation between those data in both the case.
After this, I tried to do multiple regression with diabetics as a dependent value and inactivity, and obesity as independent values.
After this, we got a summary in which the r squared for this was approximately 39%.
This was what we did today and tomorrow we will try some modifications in the hope of finding a better r square value.
Today in the lecture we tried to understand a linear model which is fit to data where both variables are non–normally distributed, skewed, with high variance and high kurtosis.
In this, we take data which contain two things
here pre molt means the size of a crab shell before molting and post molt means the size of the crab shell after molting.
here we try to make a model to predict pre molt size from post molt size.
we take data from the stat labs book chapter 7 page 139.
after we perform descriptive statistics and plot the graph of post-molt data, we get this graph with high skewness and kurtosis
we have done the same thing for pre-mot data and got this graph.
then we tried to compare both graphs side by side and we observed that both of the
graphs kind of similar with a mean difference of approximately 14
this observation raises the question of whether this difference in the mean is statistically significant or just a fluke.
for which try to do a T-TEST
according to jmp website
A t-test (also known as Student’s t-test) is a tool for evaluating the means of one or two populations using hypothesis testing. A t-test may be used to evaluate whether a single group differs from a known value (a one-sample t-test), whether two groups differ from each other (an independent two-sample t-test), or whether there is a significant difference in paired measurements (a paired, or dependent samples t-test).
Here we try to perform 10 million random samples from a pool of data with 472 random data from pre-molt and 472 data from post-molt and combine it to make a pool of 944.
Here the estimated value of z score of difference in mean of pre-molt and post-molt is approx. 13.3. here we must keep in mind that our 10 million random sample is still a very small sample for 3.86 x 10^282 such values.
Even if we take a supercomputer with a trillion samples a second it will take us 9 x 10^252 ages of our universe to obtain all those samples.
Today we learned about doing regression with two variables.
For my project, the equation for multiple regression can be given as
Y =β0 +β1X1 +β2X2….
Here Y is the %diabetics X1 is the %inactivity and X2 is the %obesity.
Now before today’s class, we were trying to find the relation between %diabetics and %inactivity, but in this approach, we got stuck so this time we tried to execute multiple regression.
Now when we try to find the correlation between %diabetics and only one variable %inactivity then we get Pearson’s R-squared is just the square of the correlation between the two variables as approximately .1952…, so here we can say that only a 20% correlation is present between these two variables.
Now initially if we make a linear model with two variables x1 and x2 (here x1 is inactivity and x2 is obesity) then the R-squared of this linear model is approximately 34%, but here it gets Interesting.
If we try to do the same thing but the difference, is we make the linear model after centring the variable then the R-squared of this model is approx. 36%, here we can see that this time we observed that the R-squared is increased by 2 % approx.
But what if we try to make a quadratic model??
Now if we try to make a quadratic model then it is observed that the R-squared becomes approx. 38%
So why don’t we just increase the power of x until we get a model that has the highest R-squared value?
This was the first thing we thought we professor told us about but here is one more important concept that comes into the picture which is overfitting.
Let’s understand with the example what overfitting means if we make a model like we mentioned above then it will be only for the selected data and if we try to use this model on other data sets then it will simply not work.
This concludes today but,
I have some doubts I will be asking the professor why we chose a quadratic model ? which was not my original doubt but a doubt one of my classmates asked but I didn’t quite understand it.
Today I tried to apply all the week’s learning to the project our professor gave us.
Today’s goals were:
So, after continuously staring at the Excel sheet for some time, I tried to plot a linear regeneration graph for % diabetics and %inactivity.
So, I plotted the graph in which the x-axis is the %inactivity data, and the axis is the %diabetic data.
After the graph was plotted, I observed that there were so many outliers in the graph and after further observation one of the important observations and a warning sign is that the linear model is heteroscedastic.
Now I thought that why not find the p-value taking the %diabetics as a null hypothesis and %inactivity as an alternative hypothesis but for that the number of data rows for both the parameters should be the same.
In the end, I am currently stuck in this problem for now.
Today we learned one of the most important topics of statistics which was
p-value.
At the start of the class, we were asked by the professor that do any anybody knew what the p value means.
After not getting any answers professor showed us a video on p-value and explained to us the concept of p-value
The above video can be used to understand the p-value
P-Value
The p-value is a value which we use to predict how likely it is to get this result if the null hypothesis is true
Let us understand this by an example:
Suppose I am showing you guys a magic trick
Let’s say we have a coin, Now I say that I got tails 41 times in a row, after this there are two hypotheses.
Now let’s understand the terms e null hypothesis which is denoted by H0 and the alternative hypothesis which is denoted by Ha
Here H0 is the hypothesis against which we are providing evidence, in our case it is that the coin is fair
The Ha is the alternative hypothesis which is that there is some kind of trickery behind this
Now say that the first time, I got tails
which is not that suspicious because it’s still a .50 per cent chance of happening but after the 3 or 4 tails you start to get suspicious that now the chances of getting the tails is .062 per cent
After this, we found out that the p-value of our null hypothesis is very low which means that the null hypothesis can be rejected and we found out that the coin was actually a two-tail coin!!
Tail | Probability |
1 | .50 |
2 | .25 |
3 | .125 |
4 | .0625 |
5 | .03125 |
In conclusion, what I understood is if the p-value is significantly smaller then the null hypothesis will be rejected
On the first day of my journey in data science, I attended my first “Advanced Statistics” class.
In this class, my professor Mr. Garry Davis taught us about our first topic: linear regression. In this topic, there are several things he covered: –
The first thing we learned is simple linear regression. The simplest way to describe simple linear regression is to predict the response of y based on a single variable x.
Here y and x are two values which can change according to the project.
Mathematically simple linear regression is represented by the following formula.
(This is a reference from the book “An Introduction to Statistical Learning”[page 70 3.1])
After this professor told us about kurtosis. According to my understanding, kurtosis is simply the average of the 4th power of z score.
Now what the z score means how many deviations is the data point is from the mean value and in my words is the error between the taken value to the mean value.
The mathematical representation of which is.
In this, the x̄ means the mean and s means the standard deviation.
The last concept which I learned is heteroscedasticity which unlike its name is a very easy topic to understand.
According to me, heteroscedasticity means when the error is not constant or if the error in the graph is dispersed in a random format, then it is heteroscedasticity and if it is present in a constant manner then it is homoscedasticity.