Exploring the trends: seasonality trends

This blog examines the seasonal patterns in the housing market and tourism industry based on a review of median home prices and hotel occupancy rates.

Comprehending these seasonal patterns is not just a theoretical endeavour. These are important insights for tourism-related businesses to have when planning and strategizing. In a similar vein, real estate agents and prospective homeowners can make more informed choices if they are aware of when the market is likely to peak or cool.

This analysis serves as a helpful reminder that seasonal patterns dictate both the housing market and the tourism industry. It provides a nuanced perspective on how various seasons of the year can influence economic activity and emphasizes the significance of timing in both industries.

Exploring the trends in foreclosure

Our most recent research focuses on trends in foreclosures, an important but sometimes disregarded area of the economy. We may learn more about the stability of the housing market and the state of the economy as a whole by looking at the variations in foreclosure petitions and deeds.

Other economic factors are closely associated with these trends. For example, increased joblessness may result in a greater number of foreclosures, and fluctuations in property values may impact the financial choices made by homeowners.

Excessive foreclosure rates affect community stability and represent actual difficulties that people and families face. They have an impact on more than just numbers.

This analysis provides insight into the housing market as well as a distinct viewpoint on the state of the economy as a whole. As we investigate and interpret these economic trends, stay tuned.

Domino effect

We are examining the economy indicator today, which has a fascinating web of interconnectedness between different components. For example, I’m examining how the volume of passengers at Logan Airport can provide information about hotel occupancy rates and the overall health of business and tourism travel. Another interesting area of study is how the housing and employment markets interact. While a slow job market can cause a decline in real estate activity, a robust job market frequently drives a strong demand for housing. Major development projects also have a notable impact on local economies, demonstrating the way in which these endeavors can stimulate the housing market and create jobs. This article aims to unravel these economic strands and demonstrate how changes in one industry can spread to other areas, presenting a complete picture of our financial environment.

 

Trends in housing market

This blog examines the housing market with a particular emphasis on the evolution of median home prices. This trip is a reflection of the economy and involves more than just pricing.

The median home price graph that we examined is comparable to a road map, illustrating the highs and lows of the market. Rising prices frequently indicate a robust economy with confident buyers and a strong demand for homes. Conversely, price declines or plateaus may indicate a cooling of the market, perhaps as a result of shifting consumer attitudes or economic difficulties.

However, these tendencies are not isolated. They are entwined with other economic strands such as interest rates, employment rates, and general state of the economy. For example, a strong job market may increase people’s ability to purchase homes, which would raise prices. Similar to how interest rate fluctuations can influence prices, they can also motivate or deter buyers.

It’s interesting to note that we also observed possible seasonal fluctuations in the housing market. Prices may be slightly impacted by periods of increased activity during the year.

It is essential to comprehend these subtleties in housing prices. It provides information on both the real estate market and the overall state of the economy. Buyers, sellers, investors, and policymakers can all benefit greatly from this analysis, which will help them make well-informed decisions in a constantly changing market.

EDA

Using a trend analysis of important economic indicators, we’re going to examine the Boston economy in more detail today. It’s similar to being an economic investigator in that we put together hints to figure out the overall picture.

The unemployment rate, hotel occupancy rates, and median home prices were our three primary points of interest. These all provide us with different insights. The unemployment rate tells us how many people are unemployed, much like a thermometer does for the labor market. It’s excellent news that when this number declines, more people typically have jobs!

We then examined hotel occupancy rate, or how full hotels are. We can see a glimpse of tourism and business travel with this rate. Low occupancy might imply the opposite, but high occupancy frequently indicates more guests and active business activity.

Finally, we investigated the median price of a home. This signal functions somewhat as a window into the housing market. A robust economy may be indicated by rising prices, which can also indicate a high demand for homes. Conversely, a decline in prices or a stagnation of them may indicate a cooling of the market.

We can gauge the state of the economy by examining these patterns.

New dataset: Economic Indicator

Various economic statistics are included in the collection, arranged by month and year. An overview of what each column stands for is given below:

  • Year and Month: The duration of the data, represented by distinct columns for each year and month.
  • The quantity of travelers using Logan Airport is logan_passengers.
    logan_intl_flights: Logan Airport’s international flight count.
  • hotel_occup_rate: Hotels’ rate of occupancy.
  • hotel_avg_daily_rate: The mean daily cost associated with lodging.
  • overall_jobs: the total quantity of employment.
  • The rate of unemployment, or unemployment rate.
  • employee_part_rate: the percentage of people in the labour force.
    pipeline_unit: Details about real estate or development initiatives; may include unit count.
  • pipeline_total_dev_cost: The total development cost for pipeline projects.
  • pipeline_sqft: The total square footage of pipeline development projects.
  • pipeline_const_jobs: The number of pipeline construction jobs created.
  • number_of_foreclosure_petitions: The number of foreclosure petitions.
  • number_of_foreclosure_deeds: The number of foreclosure deeds.
  • med_housing_price: The median price of a home. housing_sales_vol: The number of new housing construction permits issued. The total number of new building permits issued.
  • new-affordable_housing_permits: The number of new affordable housing permits issued.

SARIMA

In the field of time series analysis, the SARIMA model serves as a foundation. SARIMA (Seasonal Autoregressive Integrated Moving Average), an extension of the ARIMA model, adds another level of complexity to forecasting and is especially helpful when handling seasonal data.

A statistical model called SARIMA forecasts subsequent points in a time series. It excels at processing data with seasonal patterns, such as monthly sales data that peaks around holidays or daily temperature variations from season to season. By incorporating seasonality, the model expands on ARIMA and gains greater adaptability.

Parts:

The components of the SARIMA model are as follows: moving average (MA), autoregressive (AR), integrated (I), and seasonal (S).

  • Seasonal: This element captures recurring patterns that recur over a given period and models the seasonality in the data.
  • The model’s autoregressive (AR) component describes how an observation and a certain number of lag observations are related to each other.
  • Integrated (I): For many time series models, it is essential to differentiate the time series to make it stationary.
  • Moving Average (MA): When a moving average model is applied to lagged observations, this component simulates the relationship between an observation and a residual error.

“Demystifying Time Series Analysis: A Guide to Forecasting and Pattern Recognition”

A crucial component of data science is time series analysis, which looks at collections of data points accumulated throughout time in sequence. This technique is essential for forecasting future trends based on historical data in various sectors, including meteorology and economics. This blog aims to make time series analysis more approachable for novices while maintaining its technical foundation.

The study of data points gathered at various times is the focus of time series analysis. It forecasts future trends, finds patterns, and extracts useful statistics. Numerous fields, including weather forecasting, market trends prediction, and strategic business planning, depend on this study.

Relevant Ideas:

The ability to identify long-term movement, seasonality (the ability to identify patterns or cycles), noise (the ability to distinguish random variability), and stationarity (the assumption that statistical properties stay constant over time) are all crucial concepts.

 

Making Data Decisions for Project 3: The Search for Perceptive Analysis

As we move forward with Project 3, we have an abundance of choices because the Analyze Boston website has 246 datasets available. Right now, our team is working to determine which of these options best fits the goals of our project. This selection procedure is essential since it establishes the framework for our analysis that follows. Once a dataset has been chosen, our attention will turn to carefully going over each of its details in order to find a compelling and obvious question that arises from the data. Our analysis will be built around this question, which will help us discover fresh perspectives. Our project is in an exciting phase right now, full of possibilities for exploration as well as obstacles to overcome.

“Understanding the Pros and Cons of Decision Trees in Data Analysis”

I have gained knowledge about decision trees in today’s class. In essence, decision trees are graphical depictions of decision-making procedures. Consider them as a sequence of inquiries and decisions that culminate in a decision. You start with the first question on the tree, and as you respond to each one, you move down the branches until you reach the final choice.

Choosing the most instructive questions to pose at each decision tree node is a necessary step in the construction process. Based on different characteristics of the data, these questions are chosen using statistical measures such as entropy, Gini impurity, and information gain. The objective is to choose the most pertinent attributes at each node in order to optimize the decision-making process.

Decision trees do have certain drawbacks, though, particularly in situations where the data shows a significant spread or departure from the mean. In our most recent Project 2, we came across a dataset where the mean was significantly off from the majority of data points, which reduced the effectiveness of the decision tree method. This emphasizes how crucial it is to take the distribution and features of the data into account when selecting the best statistical method for analysis. Although decision trees are a useful tool, their effectiveness depends on the type of data they are used on. In certain cases, other statistical techniques may be more appropriate for handling these kinds of scenarios.

exploring kmeans and dbscan

A clustering technique called K-means seeks to divide a set of data points into a predetermined number of groups, or “clusters.” The first step of the procedure is to choose “k” beginning points, or “centroids,” at random. The closest centroid is then allocated to each data point, and new centroids are recalculated using the cluster average of all the points. Until the centroids no longer vary noticeably, recalculating the centroids and allocating points to the nearest centroid is repeated. As a result, there exist “k” clusters, or groups of data points closer to one another than they are to points in other clusters. The number “k,” which denotes the desired number of clusters, must be entered by the user beforehand.

Data points are grouped using the DBSCAN clustering method according to their density and proximity. Unlike k-means, which require the user to choose the number of clusters ahead of time, DBSCAN analyzes the data to identify high-density zones and distinguishes them from sparse areas. Each data point is given a neighbourhood, and if a sufficient number of points are close to one another (signalling high density), they are regarded as belonging to the same cluster. Low-density zones are considered noise since the data points there are not part of any cluster. Because of this, DBSCAN is particularly helpful for handling noisy data and finding clusters of different sizes and shapes.

Possible dangers:

DBSCAN –

  • needs to choose the density parameters.
  • A poor decision may overlook clusters or combine distinct ones.
  • struggles when the density of the clusters is variable.
  • Could potentially label sparse clusters as noise.
  • In high-dimensional data, performance can deteriorate.
  • Measures of distance become less significant
  • Points near two clusters could be given at random.

K-values:

  • The number of clusters must be specified in advance.
  • Making the wrong decision can result in subpar clustering.
  • The ultimate clusters may change after random initialization.
  • might arrive at local optima given starting positions.
  • assumes that clusters are around the same size and spherical.
  • struggles with clusters that are lengthy or asymmetrical in form.
  • prone to outlier distortion, which might affect the centroids of clusters

 

“Exploring the Interplay of Age, Race, and Threat Levels in Relation to Mental Illness in Fatal Police Shootings: A Statistical Analysis”

We have explored the connections between factors such as age, ethnicity, and perceived danger levels in relation to signs of mental illness in our investigation of fatal police shootings.

Age and Mental Health: The investigation has shown a significant correlation between a person’s age and mental health markers’ existence. A substantial age difference was found using a t-test between participants who showed indicators of mental health and those who did not; the t-statistic was 8.51 and the p-value was nearly zero. This result emphasizes how age and mental health problems are strongly correlated in these cases.

Ethnicity and Mental Health: We first encountered data issues while examining ethnicity, but they were later resolved, and a chi-square test was carried out. The results showed a significant relationship between mental health symptoms and ethnicity, with a tiny p-value of 3.98×10^-35 and a chi-square value of 171.23.

danger perception and mental health: A chi-square statistic of 24.48 and a p-value of 4.82×10^-6 indicate that there is a significant correlation between the perceived danger level and mental health indicators in our study.

In incidents of lethal police contact, the research has illuminated the strong relationships between age, ethnicity, perceived danger level, and mental health markers. These results pave the way for more thorough studies and improve our understanding of these pivotal moments. The distribution of threat levels and their interactions with other variables will be the main focus of our upcoming study phase.

Analyzing Racial Disparities in Fatal Police Shootings: A Call for Comprehensive Reform

Gaining insights into this pressing issue requires a comprehensive analysis of data spanning from 2015 to 2023. The information indicates a generally stable pattern in the monthly occurrence of events, albeit with minor fluctuations. This demonstrates the enduring nature of the problem over time.

scenario. White individuals, who constitute a significant portion of the U.S. population, are involved in approximately 50.89% of fatal police shootings. The data about the black community is particularly alarming. Despite representing only about 13% of the U.S. population, they account for a disproportionately high 27.23% of fatal police shootings. Hispanics follow, making up around 17.98% of such incidents, while other racial groups, including Asians, Native Americans, and others, are represented in smaller percentages, at 1.99%, 1.62%, and 0.29%, respectively.

This information underscores the critical need for deeper exploration and potential enhancements in policing practices, especially considering the stark disparities in how different racial groups are affected.

Solving some flukes

When I dug further into the data, I tried to answer a question I had from the beginning about the age difference between blacks and whites.

The age difference between blacks and whites is approximately 7 per cent.

Now the question is: can it really be possible, or is this just a fluke?

Now, when we plot a histogram of the ages of blacks and whites from the data, we can see that the mean of both graphs is different.

 

We can see that the mean for the grape deviates from the normal, which means the data is not normally distributed.

Now, if we want to know that 7% is a fluke, we can try to do a t-test, but here, because the data is not distributed normally, the t-test can produce a suspicious p-value because the data is not distributed normally.

So, in a case like this, we can use a Monte Carlo method to find the p-value.

In this, what we do is make a pool of data, take random data from the pool and try to estimate a p-value with that

As a result, I observed that the probability of getting an age difference of 7 years is nearly 0, which means statistically there is not a single case in which the age difference was 7 years.

So, for the next step, I try to find a pattern, like the statistics of a person who died who is armed or something like this

“Uncovering Patterns: Exploring the Relationship Between Gun Violence, Geography, and Gang Influence in the United States”

Today in class we learned some new functions and a whole new library which is Geopy.

So, what geopy does is, it helps us to locate some geographical location on the globe using some third-party coders.

This could be helpful for plotting a geomap based on data like this

Now here is what we can observe the shooting in coastal areas such as the East Coast and West Coast is very.

Large in numbers

Now there can be several reasons for the same as population, crime rate etc.

One thing in this which I suspected is that it can be a possibility that the number of guns in a state can also affect the data of shootings but in this map, we can see that Texas does not have that many locations as compared to other states in the West coast. this observation be affected by further observation because it can be affected by the population, in Texas alone there are two hotspots, maybe they are because more people live there. I think that if I further dig into this it can answer some types of questions.

The second thing I want to know is, if is there any gang influence in the area where the density of the shootings is high. By this, I can get answers to two of my questions one being whether any gang influence is there or not and the second is if there is then how much that is influencing the youth which can be a justification for early death in some races.

Finding answers

So today to find the answers to the questions from the previous blog I analyzed the data even further.

First, I try to find some type of pattern like is there any special community which is at the top and is there any significant difference between the community at the top or not?

Before answering this question the blog in Washington Post which can be found here, states that the number of deaths in the black community is significant and there is some sort of racial discrimination by the police, but when I analyzed the data I found out that the number of white people died is approximately twice than the number of the black community and have a significant difference between the other communities also.

Here two things can be observed:

  1. Due to the new data for the deaths continuously being added to the given data, it can alter the findings.
  2. When we try to look at a different point of view, we can say that the number of deaths in black communities is higher if we compare that to their population.

Now, one more question I faced is, are there any negative operations present in this data or is this just a normal finding?

In the upcoming days, I try to find the answers to these questions and develop stronger findings.

The second project’s first discussion!

Today we discussed the next project that we must do, and, in this project, we got some data on police shootings. In this data, it is given how many deaths were there and what were their race.

So initially if we look at the data, we find that the rate of death by police shooting of black people is significantly higher than that of white people.

Why is that? this is one of the answers we try to live by our analysis.

So today in class when we were discussing this project I asked the professor a question,

What are the initial questions we are trying to answer here?

To this, he explained, that we must find the questions by looking at the data, by this he meant that when we initially don’t what we can do just simply try to perform some basic commands on the data and simply just look at the data.

So, after I had done that, I observed that.

  1. The total number of deaths among whites is higher among all races.
  2. Although the total number of deaths is higher in whites if we look at the deaths as a ratio of their population then we find that blacks have a significantly higher number.
  3. Deaths at an early age are higher in blacks followed by Hispanics.

By this several questions rises

Why there is a significant difference in the deaths between races?

What are the reasons that the age of death for the black community is so low?

Is this just a fluke or the recorded data is true?

We try to answer these several questions in this project.

Report day 1

Today we started to write our report and to do so we must do a series of tasks.

  1. Collecting all the data and findings from all the other colleagues
  2. Addressing the issue for the report
  3. Concluding the report based on our findings.
  4. Presenting the results to the person we are addressing in the report.

Today we are focusing more on the first two steps.

The first thing is collecting all the data and findings from others on this project, like the different approaches they have taken and the results of those approaches.

But the most important aspect of this report is the issue.

What is the issue?

Why are we doing this?

What are we trying to find with this project?

So, for today, these are the two things we will tackle, after that, we will move to the next step of this project report.

The article “Navigating the Challenge of Predicting Diabetes: Defining Goals and Measuring Success”

After trying to find the relation between age and diabetics I found that it is simply not possible because if we want to club the state then we will end up with even less data to do our analysis

So, I tried other things to get some good results with my model like adjusting the data to perform WLS.

But even with all this, I am still not able to get a good result out of this model.

Which forced me to square one and left me with a big question?

What exactly we are trying to do?

Are we trying to make a model to predict diabetics with inactivity or obesity? Or

Are we trying to answer back to the CDC that can predict it or not?

If we are trying to make a model then we have not achieved any success till now, how can we decide that yes this is the accuracy we are looking for?

This is a very stupid but very important question for me because it changes the final report so much.

And on the topic of the final report we are parallelly we are starting to write a report because we have to submit a report on upcoming Monday.

The article is titled “Exploring the Relationship Between Diabetes and Age: Adding a Third Independent Variable for Improved Model Accuracy.”

Continuing from the previous conversation, I got an idea which may or may not help us to improve the r2 score of the model.

The idea was why not introduce a third independent variable?

So, for this, I turned toward Google and began researching the leading cause of diabetes.

After a long and intense search on Google, I came to know about a very interesting relationship between diabetics and the age of the patient.

What I found is that diabetes and age have a linear relationship with each other. It is found that the number of patients increases as the age of them increases.

From I got the idea of introducing the third independent variable “age factor”

In addition, I found some data on the CDC website but that data is divided based on states and not on the county, so I’ll be modifying the current data.

But I have some doubts which I want to solve with the professor so I will first get some doubts cleared and then I try to test my theory.

“Unveiling K-Fold Cross-Validation: Exploring the Choice of Folds and Grasping Training and Testing Errors”

In the previous blog, I mentioned that I tried k-fold cross-validation with 10 folds.

Here a question arises why 10 folds and not 4, 3, or 5 folds and does that even matter how many folds you take?

The simple answer is no. It should not matter how many folds you want because it simply doesn’t matter.

It is not a rule or norm that you necessarily have to take 5 folds or 10 folds (like in our case). You can take as many folds as you require.

K-fold is just a way to solve a problem where the training data is limited. It just depends upon what we require for the project at that time.

Now we will understand training error and test error.

  • Training error

In simple words when we apply the model to the data, we trained the model on then it means that we are simply calculating the training error.

  • Testing error

In simple words if we apply the model to the data which was unknown when we were training the model (which is the test data, then we are simply calculating the test error.

One more important aspect the professor told us in class today that when we train the model with training data and then test the model on the testing data then at the time of testing if there is more than one similar value is present there then it will not consider them separate and this will affect the accuracy of the model.

In this case, the accuracy that we were getting previously was around 33 % which is questionable.

So, to overcome this what we can do is we can label every data to give a unique identification.

For example, we have data.

[x1, y1.z1] and [x2, y2, z2]

Then what we can do is we can add numbers to this data to give them a unique identification

[1, x1, y1.z1] and [2, x2, y2,z2]

For today I must look at the data and label the data in the hope of getting better accuracy for the model

Investigating the Correlation Between %Diabetics, %Inactivity, and %Obesity: A Comprehensive Regression Analysis

After making a separate file in which we club all the values from different files like “diabetic”, “Obesity”, and “Inactivity” and named that file as a final.csv in this we contained the ‘county_state’ column which we obtained by merging the counties based on state.

After which with the help of the inner join function we obtain the d=related data from all the files.

After we collected the data, we made a linear model with two independent variables (obesity and inactivity) and one dependent variable(diabetics).

For this, we split the data into two sets.

  • Training set
  • Testing set

Here I decided to put 60% of the data into a training set and then the remaining set.

After this, we tried to calculate the performance of our model by looking at its accuracy, but my model was only 22% accurate.

After that, I try to plot some scatter plots like predicted vs. actual values and then the residual plot.

 

In this plot we observe that the data points are sort of clubbed in the middle, which is not good for a model, it should be scattered.

[plot]

The same can be observed for the residual plot also, here the residuals should be scattered all over the places, but they are all in the middle(mostly)

After this, I tried to apply k-fold cross-validation which we learned today in class.

Cross-validation

According to Wikipedia

Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.

In simpler words,

In this, we divide the data into several chunks, then try to train the model with one chunk at a time and take all the remaining chunks to test the model.

So, after I applied k-fold cross-validation on my model with 10 folds I got an r2 score of 33%

Exploring Additional Variables in Regression Analysis: Uncovering the Relationship Between %Diabetics, %Inactivity, %Obesity, and More

Today after attending the doubt session class, I got an idea, why not try to find some relation by taking some more data apart from only these two variables?

After this, I tried to find some data on the official site from where we find this data which can be found here.

With this approach, I tried to find some data like data based on whether the count is an urban county or a rural county (this Idea is from one of my classmates) and based on economics, but after analyzing those data there was no relation between those data in both the case.

After this, I tried to do multiple regression with diabetics as a dependent value and inactivity, and obesity as independent values.

After this, we got a summary in which the r squared for this was approximately 39%.

This was what we did today and tomorrow we will try some modifications in the hope of finding a better r square value.

“Uncovering the Statistically Significant Mean Difference in Crab Shell Sizes: A Tale of Skewed Data, Kurtosis, and the Infeasibility of Exhaustive Sampling”

Today in the lecture we tried to understand a linear model which is fit to data where both variables are non–normally distributed, skewed, with high variance and high kurtosis.

In this, we take data which contain two things

  • pre-molt
  • post-molt

here pre molt means the size of a crab shell before molting and post molt means the size of the crab shell after molting.

here we try to make a model to predict pre molt size from post molt size.

we take data from the stat labs book chapter 7 page 139.

after we perform descriptive statistics and plot the graph of post-molt data, we get this graph with high skewness and kurtosis

we have done the same thing for pre-mot data and got this graph.

 

then we tried to compare both graphs side by side and we observed that both of the

graphs kind of similar with a mean difference of approximately 14

this observation raises the question of whether this difference in the mean is statistically significant or just a fluke.  

for which try to do a T-TEST

T-test

according to jmp website

t-test (also known as Student’s t-test) is a tool for evaluating the means of one or two populations using hypothesis testing. A t-test may be used to evaluate whether a single group differs from a known value (a one-sample t-test), whether two groups differ from each other (an independent two-sample t-test), or whether there is a significant difference in paired measurements (a paired, or dependent samples t-test).

Here we try to perform 10 million random samples from a pool of data with 472 random data from pre-molt and 472 data from post-molt and combine it to make a pool of 944.

Here the estimated value of z score of difference in mean of pre-molt and post-molt is approx. 13.3. here we must keep in mind that our 10 million random sample is still a very small sample for 3.86 x 10^282 such values.

Even if we take a supercomputer with a trillion samples a second it will take us 9 x 10^252 ages of our universe to obtain all those samples.

Investigating the Correlation Between %Diabetics, %Inactivity, and %Obesity: A Regression Analysis

Today we learned about doing regression with two variables.

For my project, the equation for multiple regression can be given as

Y =β0 +β1X1 +β2X2….

Here Y is the %diabetics X1 is the %inactivity and X2 is the %obesity.

Now before today’s class, we were trying to find the relation between %diabetics and %inactivity, but in this approach, we got stuck so this time we tried to execute multiple regression.

Diabetics and inactivity

Now when we try to find the correlation between %diabetics and only one variable %inactivity then we get Pearson’s R-squared is just the square of the correlation between the two variables as approximately .1952…, so here we can say that only a 20% correlation is present between these two variables.

Diabetics, inactivity, and obesity

Now initially if we make a linear model with two variables x1 and x2 (here x1 is inactivity and x2 is obesity) then the R-squared of this linear model is approximately 34%, but here it gets Interesting.

If we try to do the same thing but the difference, is we make the linear model after centring the variable then the R-squared of this model is approx. 36%, here we can see that this time we observed that the R-squared is increased by 2 % approx.

But what if we try to make a quadratic model??

Now if we try to make a quadratic model then it is observed that the R-squared becomes approx. 38%

So why don’t we just increase the power of x until we get a model that has the highest R-squared value?

This was the first thing we thought we professor told us about but here is one more important concept that comes into the picture which is overfitting.

Overfitting

Let’s understand with the example what overfitting means if we make a model like we mentioned above then it will be only for the selected data and if we try to use this model on other data sets then it will simply not work.

This concludes today but,

I have some doubts I will be asking the professor why we chose a quadratic model ? which was not my original doubt but a doubt one of my classmates asked but I didn’t quite understand it.

Applying Linear Regression and Analyzing Outliers: A Statistical Analysis Journey

Today I tried to apply all the week’s learning to the project our professor gave us.

Today’s goals were:

  1.  plotting a linear regression graph.
  2. analyzing the graph.
  3. finding the p-value.

Task first

So, after continuously staring at the Excel sheet for some time, I tried to plot a linear regeneration graph for % diabetics and %inactivity.

So, I plotted the graph in which the x-axis is the %inactivity data, and the axis is the %diabetic data.


Task 2

 After the graph was plotted, I observed that there were so many outliers in the graph and after further observation one of the important observations and a warning sign is that the linear model is heteroscedastic.

Task 3

Now I thought that why not find the p-value taking the %diabetics as a null hypothesis and %inactivity as an alternative hypothesis but for that the number of data rows for both the parameters should be the same.

In the end, I am currently stuck in this problem for now.

The Significance of P-Value in Statistical Analysis

Today we learned one of the most important topics of statistics which was

p-value.

At the start of the class, we were asked by the professor that do any anybody knew what the p value means.

After not getting any answers professor showed us a video on p-value and explained to us the concept of p-value

The above video can be used to understand the p-value

P-Value

The p-value is a value which we use to predict how likely it is to get this result if the null hypothesis is true

Let us understand this by an example:

Suppose I am showing you guys a magic  trick

Let’s say we have a coin, Now I say that I got tails 41 times in a row, after this there are two hypotheses.

1 The null hypothesis: According to this I say that  the coin is fair
2 alternative hypotheses: In this, you stated that getting 41 tails in a row is suspicious and there is some kind of trickery.

Now let’s understand the terms e null hypothesis which is denoted by H0 and the alternative hypothesis which is denoted by Ha

Here H0 is the hypothesis against which we are providing evidence, in our case it is that the coin is fair

The Ha is the alternative hypothesis which is that there is some kind of trickery behind this

Now say that the first time, I got tails

which is not that suspicious because it’s still a .50 per cent chance of happening but after the 3 or 4 tails you start to get suspicious that now the chances of getting the tails is .062 per cent

After this, we found out that the p-value of our null hypothesis is very low which means that the null hypothesis can be rejected and we found out that the coin was actually a two-tail coin!!

Tail Probability
1 .50
2 .25
3 .125
4 .0625
5 .03125

In conclusion, what I understood is if the p-value is significantly smaller then the null hypothesis will be rejected

“How Our First Day of Class Went: A Recap “

On the first day of my journey in data science, I attended my first “Advanced Statistics” class.

In this class, my professor Mr. Garry Davis taught us about our first topic: linear regression. In this topic, there are several things he covered: –

  • Simple linear regression
  •  kurtoses
  •  Heteroscedasticity or Heteroskedasticity (both things are exact and correct, just spelt differently)
  •  Z-score

The first thing we learned is simple linear regression. The simplest way to describe simple linear regression is to predict the response of y based on a single variable x.

Here y and x are two values which can change according to the project.

Mathematically simple linear regression is represented by the following formula.

(This is a reference from the book “An Introduction to Statistical Learning”[page 70 3.1])

After this professor told us about kurtosis. According to my understanding, kurtosis is simply the average of the 4th power of z score.

Now what the z score means how many deviations is the data point is from the mean value and in my words is the error between the taken value to the mean value.

The mathematical representation of which is.

 

In this, the x̄ means the mean and s means the standard deviation.

 

  • This is the graph which the professor used to describe it, and this helped me to understand the concept very well.

The last concept which I learned is heteroscedasticity which unlike its name is a very easy topic to understand.

According to me, heteroscedasticity means when the error is not constant or if the error in the graph is dispersed in a random format, then it is heteroscedasticity and if it is present in a constant manner then it is homoscedasticity.