powerpoint presentation

Model Comparison

Movie Breakeven Analysis In

U.S Market

Liu Jialin | Priyadarshini Majumdar | Zhang Jiexi

Data Analytics Lab Project Challenge from Nov 23rd onwards at a theatre near YOU

INTRODUCTION

METHODOLOGY

What plays the most important role

in making a movie profitable?

Movie technical

Language 4.3/10 Content rating 4/10 Aspect ratio 3/10 Budget 2.5 /10 Duration 1.5/10 Colour or B&W 1/10

IMDB website Influence

No of IMDB users who voted 9/10 No of users reviewed 8/10 No of critics for reviews 6/10 IMDB score 5/10

Facebook influence

• Movie Facebook likes 4.5/10• Actor 3 Facebook likes > Actor 2 > Actor 1• Cast total Facebook likes 3.5/10• Director Facebook likes 3.2/10

Poster and Promotional

materials

No of faces in a poster 2.6/10

objectives

Data Processing1

Remove repetitive entries in JMP.

Calculate gross profit=

Create the binary Profit/Loss target

variable and remove missing values.

SAS Enterprise Miner:

• Import the JMP file using File

Import and Save Data nodes.

• Change the level for Aspect Ratio to nominal in the File Import node.

• Conduct text parsing,

text clustering and text filter on

plot key words and genres.

• Use Multiplot node to view the distribution of the variables.

• Recode missing values and

erroneous entries using

Replacement node.

• Sample the data into Training Set and Validation Set using the Data

Partition node.

Before running the parametric

models, fill in all missing values using the Impute node and transform

the interval variables with skewed

distributions using the Transform node.

Predictive Model Construction

Decision Tree

Applying nonparametric algorithm, decision tree is capable of fitting a large number of

functional forms and mapping observations to categorical targets.

Model Comparison

Conclusion

Background: Movies are one of the top grossing industries in the world today and in the U.S. itself it is a 38 billion dollar market as of 2016

Motivation:IMDB is one of the top visited sites through which viewers often decide whether to watch a movie or not. Hence this has

a direct effect on whether a movie will profit or loss.

Primary Objective: To develop a model that can predict whether a movie will break even in the U.S. market or not.

Secondary Objective: To relay to promoters who use social media for movie promotion on which factors affect the outcome of the movie

Confusion Matrix for Model Comparison

Gradient Boosting

A Gradient Boosting model builds up a strong learning tree from a base set of weak

learning trees, using Gradient Descending algorithm. It is computational intensive and has excellent performance for moderate number of variables after fine-tuning.

Logistic Regression

Logistics regression describes the relationship between categorical target variable and

independent variables by estimating the probability from a cumulative logistic distribution.

Neural Network

Neural network is a parametric model that accommodates a wider variety of nonlinear

relationships. Neural network also keeps checking the curse of dimensionality problem which bedevils attempts to model non-linear functions with large number of variables.

Data set

5043 movie titles

28 variables

The data set was scrapped from

IMDB using Python’s scrappylibrary. This resulted in 5043

observations of 28 variables.

Random Forest

Random forest is ensemble of decision trees. It averages the predictive probability of

a large number of over trained decision trees, thus is more robust against overfitting and more generalized than a single decision tree.

Most influential

factors

2nd Most influential

factors

3rd Most influential

factors

Least influential

factor

2 3

4

Target percentages show how accurate the model’s predictions are

towards future data set. Outcome percentages, on the other hand, indicate the accuracy of model prediction for the sample data set. For

Gradient Boosting and Neural Network, the Outcome 1/1 percentages

are above 75%, which means the models have successfully predicted 75% of the breakeven movies. The Target 1/1 percentages are above

70%, which means the models predictions are reliable. Hence, Gradient

Boosting and Neural Network are the models chosen to predict the breakeven status of the future movies in the U.S. market.

Misclassification rate takes the false positives

and the false negatives into consideration. Of all the models, Gradient Boosting has the

lowest misclassification rate. This is not

surprising given the delicate algorithm that seeks to minimise the intermediate pseudo-

residuals rather than simply relying on one

splitting criterion like in Decision Tree and Random Forest. Neural Network 2 works the

second best, proving that its complicated

algorithm which imitates human mind indeed has some advantage in building predictive

models.

The analysis and data set are highly reliant on online data given that it is extracted

from a movie rating website. This is however is not the only defining factor.

• Hence, further analysis on predicting movie successes should also take into consideration traditional promotional channels such as theatre data.

• Additionally this data is collected over a period of time and when it comes to

movies, popularity of the movie grows over a period of time. Hence for a more

accurate analysis, time-stamps of the metrics must be collected and taken into consideration.

• The most important insight from the above predictive analysis is that

online popularity of a movie is the best indicator of its success

• IMDB is a sought after site for movie opinions and hence movie votes, critic reviews and general public reviews are the greatest influencers

• For Facebook likes Actor 3 Facebook likes are a better indicator than

actor 2 and actor 1 Facebook likes.

𝑔𝑟𝑜𝑠𝑠−𝑏𝑢𝑑𝑔𝑒𝑡

𝑏𝑢𝑑𝑔𝑒𝑡%

future work

powerpoint presentation

Documents