Introduction & Motivation
Research Questions
Data Collection
Data Description
User Score
Preliminary Analysis
Methodology & Diagnostics
Results
Discussion
World Gross
Preliminary Analysis
Methodology & Diagnostics
Results
Discussion
Conclusion
Sources
Film has become a major form of art, inspiring different writers and directors to tell all sorts of riveting stories. At the same time, films can also be regarded as a form of investment that is extremely dependent on capital and industrial standards. Film has also transformed into a major form of entertainment, consuming countless hours in many people's lives. With how influential films are to creators, investors, and casual movie-goers, it is important to ask what makes a film "successful"? Naturally, a question like this can be difficult to answer directly since "successful" can be defined in many different ways. In order to address this issue, we decided, for our analysis, to specifically associate a movie's "success" with its user score and world gross. With "success" explicitly defined, we were able to narrow our focus and ask the following questions.
What are the significant factors that impact whether or not a movie gets a good user score?
What are the significant factors that impact the world gross of a movie?
All of our data comes from the following websites: IMDB Top 1000 Movies, IMDB Bottom 1000 Movies, The Numbers Budget & Financial Performance, Oscar Winners, and Insider Top 27 Movie Franchises. We used web scraping to extract the data we wanted from each website and then combined the different datasets into one final dataset. We cleaned our data by deleting any rows that had no values, recategorizing several variables, such as genre, and making sure the different datasets matched up well with each other. We used several functions to assist us in creating our final dataset.
Using web scraping, we use a self-defined function to create a data frame that extracts the release date, movie title, production budget, and worldwide gross for 6345 movies from the Numbers website. The data frame contains the information of every movie on the website with modifications needed because of the errors in data entry, such as the NA value or an unknown release date. In the data cleaning process, we use self-defined functions to standardize movie titles by removing movie titles with non-English language characters, non-character symbols, and misspelling and combining similar movie titles from different data frames to expand the information for the movies.
By web scraping from the IMDB websites, we extracted information about the top 1000 movies and bottom 1000 movies, which includes movie titles, meta score, user score, year, genre, movie rating, duration in minutes, directors, and lead actor. After extracting these 2000 rows of data, we dropped any rows that had no values and ended up with 1663 rows. By looking at our data set, we noticed that there are many different genres for each movie. We simplified the genres by categorizing and simplifying each row into the most dominant genre. In this way, we ended up with 9 different kinds of genres. We also added several new columns with dummy variables, such as director_one, which would indicate whether or not a movie had only one director.
In addition to this, we ultilized an undocumented API when extracting information from the Oscars website. More specifically, we extracted all Oscar winners from 1934 to 2021. We then filtered for winners that were either actors or directors. Afterward, we created functions that would produce new binary variables, which would tell us the movies with an Oscar lead and the movies with an Oscar director.
We also web scraped from the Insider website, extracting movies which are under top franchises selected by movie critics. Some notable franchises include Spider-Man, Batman, Star Wars, and James Bond. After extracting these titles, we created functions that would produce a new indicator variable, which would indicate the movies in our dataset that are part of an established franchise/IP.
year | gross_wor | budget | score_user | score_user_good | score_meta | ip | oscar_lead | director_one | oscar_director | ... | rating_r | genre_action_adv | genre_animation | genre_bio | genre_comedy | genre_comedy_drama | genre_drama | genre_horror | genre_romance | genre_fantasy_sci | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008 | 2.690657e+08 | 105000000.0 | 5.1 | 0 | 34.0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1995 | 1.688415e+08 | 29000000.0 | 8.0 | 1 | 74.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 2013 | 1.807651e+08 | 20000000.0 | 8.1 | 1 | 96.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2019 | 3.891404e+08 | 100000000.0 | 8.2 | 1 | 78.0 | 0 | 0 | 1 | 1 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2003 | 5.966762e+07 | 20000000.0 | 7.6 | 1 | 70.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
930 | 2007 | 8.308008e+07 | 85000000.0 | 7.7 | 1 | 78.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
931 | 2011 | 1.708055e+08 | 80000000.0 | 5.2 | 0 | 30.0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
932 | 2016 | 5.534869e+07 | 50000000.0 | 4.7 | 0 | 34.0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
933 | 2006 | 1.250619e+07 | 35000000.0 | 4.3 | 0 | 26.0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
934 | 2016 | 1.004630e+09 | 150000000.0 | 8.0 | 1 | 78.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
935 rows × 27 columns
In the end, our analysis utilizes the dataset shown above, with 935 observations and 27 different variables. These variables include year, gross_wor, budget, score_user, score_user_good, score_meta, ip, oscar_lead, director_one, oscar_director, mins, log_year, log_gross_wor, log_budget, log_mins, rating_pg, rating_pg_13, rating_r, genre_action_adv, genre_animation, genre_bio, genre_comedy, genre_comedy_drama, genre_drama, genre_horror, genre_romance, genre_fantasy_sci. year is the year a movie is released. gross_wor is the world gross of a movie measured in USD. budget is the production budget of movie measured in USD. score_user and score_meta are user ratings, scaled from 0 to 10, and critic ratings, scaled from 0 to 100, with dummy variable score_user_good to indicate whether or not the user rating is greater than or equal to 7. ip is a dummy variable that indicates whether or not a movie is part of an established IP, such as Spider-Man, Batman, Star Wars, and James Bond. oscar_lead and oscar_director are dummy variables that tell us whether or not a movie has an Oscar-winning lead actor and an Oscar-winning director respectively. director_one tells us whether or not a movie has one director. mins is the duration of a movie measured in minutes. log_year, log_gross_wor, log_budget, and log_mins are the log of year, gross_wor, budget, and mins respectively. Each of the rating and genre related variables are indicators that tell us whether or not a movie has a particular rating and whether or not it belongs to a particular genre.
Before applying any methodology, we conducted some premilinary analysis on score_user. We started by creating different scatterplots with score_user as our response variable and year, gross_wor, budget, mins, and score_meta as separate explanatory variables. However, rather than following a linear or a dispersed pattern, points in each plot strictly appear either below or above score_user = 7. This indicates that score_user behaves less like a continuous response and more like a binary response, with one category where score_user is less than 7 or "bad" and another where score_user at least 7 or "good".
This binary characteristic of score_user is also highlighted in its distribution. As shown above, we can see that the distribution is bimodal, with one center around score_user = 5 and the another around score_user = 8. Again, this inidicates that we can divide score_user into 2 groups, with one group representing bad user scores and another representing "good" user scores.
Given the "binary" characteristic revealed by the scatterplots and distribution, we believed that it would be most appropriate to apply a logististic regression. However, with this technique, the response of each observation must independently follow a Bernoulli distribution. Now, it is quite difficult for us to confirm independence since we lack information on the specific people writing movie reviews (i.e. the score of a horror film will likely not be independent from the score of a romance film if they are both reviewed by people who are hardcore horror fans). And so, for the purposes of our analysis, we assume that there is independence.
With that said, we started by creating a new feature called score_user_good, which will convert score_user into a Bernoulli variable that becomes 1 if score_user is at least 7 or "good" and becomes 0 if score_user is less than 7 or "bad".
With the addition of score_user_good, we can conduct our preliminary analysis with a new perspective. Instead of considering how score_user is correlated with other continuous variables, we can compare different distributions to gain some understanding of how different values for a continuous input affect the probability of a movie getting a good user score.
It is worth mentioning that we also had to create new features such as log_year, log_gross_wor, log_budget, and log_mins which log-transform year, gross_wor, budget, and mins respectively. This is because there were extremely low values for year and extremely high values for gross_wor, budget, and mins, which may cause the effect of other covariates to be diluted. By applying a log-transformation, these variables are put on a common log-scale and won't dilute the impact of any other covariates. We, however, did not apply this to score_meta since certain observations had score_meta = 0 (NOTE: log(0) is undefined). Although, we believe this will not cause too many issues since there are no values for score_meta that are extremely small or large to dilute other covariates.
With that said, the plots above show distributions for log_year, log_gross_wor, log_budget, log_mins, and score_meta given that a movie gets a good user score, which is marked in green, or a bad user score, which is marked in red. For the log_gross_wor and log_budget plots, we can see that there is a lot of overlap between the good and bad user score distributions, which suggests that log_gross_wor and log_budget may each be independent from score_user_good. For the log_year and log_mins plot, there is still a considerable amount of overlap; however, we can see that the probability of a movie getting a good user score is slightly higher when it is recent and slightly lower when it is longer. This suggests that log_year and log_mins may have a slight significant effect on whether or not a movie gets a good user score. For the score_meta plot, there is the least amount of amount overlap, and we can clearly that the probability of a movie getting a good user score given a higher meta score is higher than when given a lower meta score. This suggests that score_meta may have a huge significant effect on whether or not a movie gets a good user score.
P(score_user_good=1|x=1) | P(score_user_good=1|x=0) | |
---|---|---|
ip | 0.915254 | 0.425799 |
oscar_lead | 0.701031 | 0.428401 |
director_one | 0.452244 | 0.515152 |
oscar_director | 0.803922 | 0.436652 |
rating_pg | 0.407821 | 0.468254 |
rating_pg_13 | 0.334262 | 0.532986 |
rating_r | 0.589421 | 0.358736 |
genre_action_adv | 0.429204 | 0.465444 |
genre_animation | 0.568182 | 0.451178 |
genre_bio | 0.943396 | 0.427438 |
genre_comedy | 0.169492 | 0.498164 |
genre_comedy_drama | 0.603175 | 0.446101 |
genre_drama | 0.746575 | 0.403042 |
genre_horror | 0.198473 | 0.498756 |
genre_romance | 0.391304 | 0.465854 |
genre_fantasy_sci | 0.435897 | 0.457589 |
In order to gain some understanding on how our binary covariates may effect score_user_good, we calculated the sample probabilities of getting a good user score when a covariate is either 1 or 0. As we can see in the table above, covariates marked in red are those where these probabilities are roughly equal (NOTE: the difference tolerance that we set is at most 10%). This suggests that a movie getting a good user score may not depend on whether or not it is directed by one director, PG, action/adventure, romance, or sci-fi/fantasy. This leaves us with the unmarked covariates, where the probability of getting a good user score when x = 1 is not equal to when x = 0. This suggests that a movie getting a good user score may depend on whether or not it is part of an established IP, PG-13, R, animation, biography, comedy, comedy-drama, drama, horror, has Oscar leads, or has Oscar directors.
coef | std err | z | P>|z| | |
---|---|---|---|---|
const | 1381.323500 | 291.223000 | 4.743000 | 0.000000 |
log_year | -269.034500 | 55.629000 | -4.836000 | 0.000000 |
log_budget | -1.873400 | 0.362000 | -5.174000 | 0.000000 |
log_gross_wor | 0.915600 | 0.216000 | 4.244000 | 0.000000 |
log_mins | 11.755400 | 2.550000 | 4.609000 | 0.000000 |
score_meta | 0.202000 | 0.024000 | 8.569000 | 0.000000 |
ip | 0.712800 | 1.065000 | 0.669000 | 0.503000 |
oscar_lead | 1.510200 | 1.072000 | 1.409000 | 0.159000 |
director_one | -0.439600 | 1.044000 | -0.421000 | 0.674000 |
oscar_director | 0.531600 | 1.470000 | 0.362000 | 0.718000 |
rating_pg | 459.395300 | 96.729000 | 4.749000 | 0.000000 |
rating_pg_13 | 460.527000 | 97.189000 | 4.738000 | 0.000000 |
rating_r | 461.401200 | 97.309000 | 4.742000 | 0.000000 |
genre_action_adv | 152.872000 | 32.308000 | 4.732000 | 0.000000 |
genre_animation | 157.433600 | 33.004000 | 4.770000 | 0.000000 |
genre_bio | 154.614600 | 32.416000 | 4.770000 | 0.000000 |
genre_comedy | 151.378200 | 32.095000 | 4.717000 | 0.000000 |
genre_comedy_drama | 153.490200 | 32.403000 | 4.737000 | 0.000000 |
genre_drama | 154.426000 | 32.464000 | 4.757000 | 0.000000 |
genre_horror | 150.792100 | 32.141000 | 4.692000 | 0.000000 |
genre_romance | 152.767500 | 32.247000 | 4.737000 | 0.000000 |
genre_fantasy_sci | 153.549300 | 32.273000 | 4.758000 | 0.000000 |
VIF | |
---|---|
rating_pg_13 | inf |
rating_r | inf |
genre_romance | inf |
genre_horror | inf |
genre_drama | inf |
genre_comedy_drama | inf |
genre_comedy | inf |
genre_bio | inf |
genre_animation | inf |
genre_action_adv | inf |
genre_fantasy_sci | inf |
rating_pg | inf |
log_mins | 2.365346 |
log_budget | 2.336267 |
score_meta | 2.064101 |
log_gross_wor | 2.048218 |
log_year | 1.410934 |
ip | 1.308119 |
director_one | 1.160660 |
oscar_director | 1.090596 |
oscar_lead | 1.083236 |
coef | std err | z | P>|z| | |
---|---|---|---|---|
const | -1.036500 | 0.376000 | -2.760000 | 0.006000 |
ip | 3.441800 | 0.503000 | 6.849000 | 0.000000 |
oscar_lead | 0.930200 | 0.281000 | 3.309000 | 0.001000 |
director_one | -0.559000 | 0.329000 | -1.698000 | 0.089000 |
oscar_director | 1.625100 | 0.415000 | 3.917000 | 0.000000 |
rating_pg | 0.661200 | 0.263000 | 2.511000 | 0.012000 |
rating_r | 1.661300 | 0.202000 | 8.244000 | 0.000000 |
genre_animation | 0.856600 | 0.424000 | 2.021000 | 0.043000 |
genre_bio | 3.425900 | 0.633000 | 5.414000 | 0.000000 |
genre_comedy | -1.097400 | 0.319000 | -3.439000 | 0.001000 |
genre_comedy_drama | 0.865200 | 0.330000 | 2.620000 | 0.009000 |
genre_drama | 1.366300 | 0.273000 | 4.997000 | 0.000000 |
genre_horror | -1.472600 | 0.315000 | -4.671000 | 0.000000 |
genre_romance | 0.191400 | 0.273000 | 0.701000 | 0.483000 |
genre_fantasy_sci | -0.009200 | 0.405000 | -0.023000 | 0.982000 |
VIF | |
---|---|
director_one | 4.358739 |
rating_r | 2.284861 |
rating_pg | 1.929548 |
genre_drama | 1.732625 |
genre_horror | 1.619540 |
genre_comedy | 1.557111 |
genre_romance | 1.486488 |
genre_animation | 1.395103 |
genre_comedy_drama | 1.301074 |
genre_bio | 1.273952 |
oscar_lead | 1.167202 |
genre_fantasy_sci | 1.152787 |
ip | 1.132938 |
oscar_director | 1.088296 |
As of now, we suspect that variables such as log_year, log_mins, score_meta, ip, oscar_lead, oscar_director, rating_pg_13, rating_r, genre_animation, genre_bio, genre_comedy, genre_comedy_drama, genre_drama, and genre_horror may be significant factors that impact whether or not a movie gets a good user score. However, we put this to the test by first fitting a full logistic model on score_user_good. Although we cannot confirm the independence of each observation, we feel that it is most appropriate to apply logistic regression because, as shown earlier, score_user behaves like a binary variable.
With that said, the first pair of tables above show the full model summary alongwith VIF measures for each covariate (NOTE: VIF is a measure of how correlated a given covariate is to other covariates, where a value greater than 5 indicates high correlation). As shown in the VIF table, the full model has covariates with extremely large VIF values. This is a problem because this indicates that multicollinearity is present. With multicollinearity, we are more uncertain as to the true effect a particular covariate has on our response, which may explain why some of our standard errors are extremely large. One way that we decided to remedy this is by removing covariates with the highest VIF one at a time until all covariates had a VIF value less than 5, which left us with the reduced model summarized above.
Based on our reduced model summary, ip, oscar_lead, oscar_director, rating_pg, rating_r, genre_animation, genre_bio, genre_comedy, genre_comedy_drama, genre_drama, and genre_horror are the significant factors that impact whether or not a movie gets a good user score. Specifically, the estimated difference in log-odds of getting a good user score, holding all other variables constant is 3.44 between movies part of an established IP vs those that aren't, 0.93 between movies with an Oscar lead vs those without one, 1.63 between movies with an Oscar director vs those without one, 0.66 between PG vs non-PG movies, 1.66 between R vs non-R movies, 0.86 between animation vs non-animation movies, 3.43 between biography vs non-biography movies, -1.10 between comedy vs non-comedy movies, 0.87 between comedy-drama vs non-comedy-drama movies, 1.37 between drama vs non-drama movies, and -1.47 between horror vs non-horror movies.
In other words, it is more likely for a movie to get a good user score when it is part of an established IP, PG, R, animation, biography, comedy-drama, drama, Oscar-led, or Oscar-directed. On the other hand, it is less likely for a movie to get a good user score when it is comedy or horror.
However, it is worth noting that some of our coefficient p-values for our reduced model may be "exaggerated" and a lot smaller than what they're supposed to be. As such, the reported effects on the likelihood of getting a good user score may be greater than the actual truth.
Although some of our reported effects may be "exaggerated", many of the general trends that we are seeing are quite sensible.
For instance, it is not peculiar to see IP play a significant role in increasing the likelihood of getting a good user score. One possible explanation is that movies part of an established IP like Spider-Man or Batman can bring back warm, nostalgic memories. This may then result in casual viewers feeling overly joyous and sentimental- so much so that they'd feel compelled to write a good review.
In addition to this, it is not surprising that Oscar leads and directors also increase the chance of users giving good scores. This could be explained through the simple notion that these acclaimed individuals possess the best acting and directing skills necessary to provide the most exciting, entralling movie experience. Such captivation is bound to leave audience members more than satisfied to give a good rating.
It is also not far-fetched to see genres like biography, drama boost a movie's user score. This is likely connected to how biography and drama are serious, grounded genres, which would push writers to tell deep, compelling stories that engages with movie-goers. This may eventually result in viewers developing a profound connection with a movie and thus giving a good score. A similar explanation could be provided for R movies, but rather than talking about serious genres, we'd be talking about serious ratings.
Animation also appears to boost user score. This could be because animated movies are generally geared towards a younger audience, striving to teach children meaningful life lessons. It is likely that many parents are able to understand these subliminal messages and may feel obligated to post good reviews in order to let other parents know that a particular movie is worth watching with the children. A similar explanation could be provided for PG movies.
While these factors are shown to increase the chance of getting a good user score, comedy and horror movies are shown to decrease this chance. One possible explanation is that comedy and horror movies are generally geared towards teenagers, who often go to the movies to "turn off their brains". This likely results in many producers saving money on "cool fight scenes" rather than talented writers. Without people to write engaging stories, more matured viewers are bound to feel disappointed with their movie experience and leave a poor score.
Before applying any methodology, it is worth noting that there is a weak positive correlation or no correlation between the log_gross_wor against other numeric variables in those scatterplots with one exception. The scatterplot for log_budget vs log_gross_wor shows a clear positive correlation between the two variables, unlike the other scatterplots. In context, it suggests that a higher movie budget tends to have a higher gross worldwide, and factors like year, mins, and score meta won’t necessarily increase gross worldwide.
For the boxplots of score_user_good, ip, oscar_lead, director_one, and oscar_director, only the boxplot for oscar_lead and director_one has a similar median and the length of the box overlaps with the other group within the variable, which indicate it’s likely that having an oscar lead or having only one director might not make a difference to the movie gross worldwide. For other variables, the median and length of the box are different within each variable, which indicates that having a good user score, an ip movie, or an oscar director tends to make a noticeable difference to a movie's gross worldwide.
For the boxplots of different ratings, it is worth pointing out that all boxplots have roughly the same center and shape, which suggests that ratings might not influence a movie's worldwide gross.
For the boxplots of different genres, it is worth pointing out that all boxplots have roughly the same center and shape, which suggests that genre might not influence a movie's worldwide gross.
As of now, we suspect that variables such as log_budget, score_user_good, ip, and oscar_director may be significant factors that impact a movie's worldwide gross. However, we put this to the test by first fitting a full linear regression model on log_gross_wor and checking for homoskedasticity, normality, and influential points.
The fitted versus residuals plot shows the spread of the residuals is decreasing as the fitted values change, which looks like a funnel shape. The spread tells us the variance is not constant, so we conclude that there is heteroskedasticity. The graph has dashed lines at -3 and 3 with the space shaded in below -3 and above 3, which shows that any point within these intervals is an outlier. From this, we know there are several outliers.
The second scatter plot is a normality QQ plot, which tells us about the distribution of the residuals. If the errors are normally distributed, we expect almost all residuals to align with the straight blue line. But in this case, we can see that the distribution is not normal but slightly left-skewed.
In the third scatter plot, which is the leverage versus residuals plot, there does not appear to be any influential points. Although there are a considerable amount of outliers, none of them appear to have extremely large leverage.
coef | std err | P>|t| | |
---|---|---|---|
Intercept | -158.300400 | 36.108000 | 0.000000 |
log_year | 30.554700 | 6.866000 | 0.000000 |
log_budget | 0.632400 | 0.039000 | 0.000000 |
log_mins | 0.657700 | 0.282000 | 0.020000 |
score_meta | 0.006800 | 0.003000 | 0.018000 |
score_user_good | 0.922500 | 0.156000 | 0.000000 |
ip | 0.591900 | 0.163000 | 0.000000 |
oscar_lead | -0.189300 | 0.118000 | 0.110000 |
director_one | -0.309800 | 0.146000 | 0.034000 |
oscar_director | -0.020200 | 0.159000 | 0.899000 |
rating_pg | -52.443400 | 12.014000 | 0.000000 |
rating_pg_13 | -52.736600 | 12.047000 | 0.000000 |
rating_r | -53.120400 | 12.048000 | 0.000000 |
genre_action_adv | -17.615400 | 4.003000 | 0.000000 |
genre_animation | -17.731800 | 4.045000 | 0.000000 |
genre_bio | -17.716300 | 4.015000 | 0.000000 |
genre_comedy | -17.499800 | 4.006000 | 0.000000 |
genre_comedy_drama | -17.881700 | 4.019000 | 0.000000 |
genre_drama | -17.594300 | 4.011000 | 0.000000 |
genre_horror | -17.129400 | 4.011000 | 0.000000 |
genre_romance | -17.651800 | 4.011000 | 0.000000 |
genre_fantasy_sci | -17.479900 | 4.003000 | 0.000000 |
abs_res | |
---|---|
67 | 7.182360 |
900 | 5.495361 |
517 | 4.384742 |
762 | 4.306547 |
691 | 3.986845 |
618 | 3.793118 |
882 | 3.767409 |
358 | 3.601483 |
860 | 3.559064 |
901 | 3.445995 |
738 | 3.066468 |
701 | 3.024348 |
stat | conclude | |
---|---|---|
error_corr_dw | 1.865000 | NOT CORRELATED |
r_sq_adj | 0.520000 | MODEL EXPLAIN AT LEAST HALF VAR |
The third chart shows that adjusted R squared is 0.52, which means 52% of the variability observed in the target variable is explained by the regression model. We also applied Durbin-Watson test for autocorrelation in the residuals. The range for the Durbin-Watson statistic is from 0 to 4, and we find that our test statistics is 1.87, which indicates there is no autocorrelation of errors.
After dropping outliers such as observation 67, 900, 517, 762, 691, 618, 882, 358, 860, 901, 701, 260, 659, 496, 738, 830, 718, and 838, the fitted versus residuals plot show the spread of the residuals remaining fairly constant as fitted values change. This suggests that homoscedasticity may be upheld.