For example, a research study might explain that the consumption of ice cream affects the change in crime committed. This can be mathematically explained as allows: Crime_Rate = boo + bal(lace_Cream_Consumption_Rate) + ? 0 boo explains that Crime Rate may change independent of Ice Cream Consumption rate or any other variable. Hence, without the weather variable (which is omitted) we may be tempted to infer that ice cream sales affect crime rates. E explains the error that may contain values that are not considered in the study. This relationship has omitted the variables such as weather that may affect the crime rate.

A warm weather is the time when criminals may be encouraged to go out and rob or do some crime. However, a warm weather may also induce the mineral population to purchase ice-cream which drives up ice cream sales. This regression is heavily biased because it only takes into account the effect that rank of a movie affects the Subsequent earnings of the movie. These earnings may also depend on the star power of the cast (example: Does it have a Bradley Cooper or Seth Roger or Adam Candler or Amy Adams… ) and/or the brand of the production house (Is it produced by a Walt Disney, Paxar Studios, Marvel Studios… Etc. This is the most important factor that affects regression. C] Reverse Causation C] To explain this factor, let us assume there are two events – A and B. If both are correlated, we cannot infer if A happens because of B or B happens because of A. 0 For example, if a dating website runs a research on its user data and estimates a correlation between men’s hygiene levels, H, and the rate of women selecting men, R. D It cannot be proved whether, R increases due to H or if H increases due to R. D Similarly, it cannot be inferred if a movie’s rank affects earnings or if earnings affect rank of a movie.

It can be either one of the following: Subsequent Earnings = be*exponential(be*Releaser) Release Rank = be*exponential(b 1 *Subsequent Earnings) C] Sampling Bias C] This factor arises because of a flaw in selecting our sample which may not be an accurate representation of the population. A simple example, that explains this bias, is by taking a sample of high school students to measure use of illegal drugs usage. This may not be accurate sample because by only considering high school students we are 21 Page excluding home-schooled students or drop outs which could also be a part of the population.

Also, if they are included in the sample, we may not have included a preventative number of home-schooled students and dropouts as compared to our population. Similarly, in our database, we are assuming that movies are probably only of a selected genre or of particular nations and may not have a (for example) action genre included. We can use more data to reduce the above listed factors that affect the regression. This data would include, more movie-defining vectors like Production House Release, Actors In the Movie, Marketing Budget, Genre, Languages released in etc.

More data like the reasoning behind the selection of the movies n our movies database would help in understanding if this data is an accurate representation of the entire movie database population Part ID To explain why these regressions are biased let us assume the equation of the following type: y = be*x + E This might be the true model of the data. Here, a naive inference can be drawn that x contains coefficient be that affects y. However, E also contains many variables which have a causal effect on y.

In our movie database regression, E may contain many variables such as star power, movie genre, and production house which also affect y. Omitted Variable could e caused here because we still may be missing out on some vectors that classify movies on further parameters. However, we use our best judgments and data to reduce this. Part 3 a) The data is captured in such a way that the relationship between Subsequent Earnings and Rank of the movie is represented by an exponential equation such as: Subsequent Earnings = *Releaser) However, there are many cases where the movie has secured a rank but has earned zero Subsequent Earnings.