powerpoint presentation

1
Model Comparison Movie Breakeven Analysis In U.S Market Liu Jialin | Priyadarshini Majumdar | Zhang Jiexi Data Analytics Lab Project Challenge from Nov 23 rd onwards at a theatre near YOU INTRODUCTION METHODOLOGY What plays the most important role in making a movie profitable? Movie technical Language 4.3/10 Content rating 4/10 Aspect ratio 3/10 Budget 2.5 /10 Duration 1.5/10 Colour or B&W 1/10 IMDB website Influence No of IMDB users who voted 9/10 No of users reviewed 8/10 No of critics for reviews 6/10 IMDB score 5/10 Facebook influence Movie Facebook likes 4.5/10 Actor 3 Facebook likes > Actor 2 > Actor 1 Cast total Facebook likes 3.5/10 Director Facebook likes 3.2/10 Poster and Promotional materials No of faces in a poster 2.6/10 objectives Data Processing 1 Remove repetitive entries in JMP. Calculate gross profit= Create the binary Profit/Loss target variable and remove missing values. SAS Enterprise Miner: Import the JMP file using File Import and Save Data nodes. Change the level for Aspect Ratio to nominal in the File Import node. Conduct text parsing, text clustering and text filter on plot key words and genres. Use Multiplot node to view the distribution of the variables. Recode missing values and erroneous entries using Replacement node. Sample the data into Training Set and Validation Set using the Data Partition node. Before running the parametric models, fill in all missing values using the Impute node and transform the interval variables with skewed distributions using the Transform node. Predictive Model Construction Decision Tree Applying nonparametric algorithm, decision tree is capable of fitting a large number of functional forms and mapping observations to categorical targets. Model Comparison Conclusion Background: Movies are one of the top grossing industries in the world today and in the U.S. itself it is a 38 billion dollar market as of 2016 Motivation: IMDB is one of the top visited sites through which viewers often decide whether to watch a movie or not. Hence this has a direct effect on whether a movie will profit or loss . Primary Objective: To develop a model that can predict whether a movie will break even in the U.S. market or not. Secondary Objective: To relay to promoters who use social media for movie promotion on which factors affect the outcome of the movie Confusion Matrix for Model Comparison Gradient Boosting A Gradient Boosting model builds up a strong learning tree from a base set of weak learning trees, using Gradient Descending algorithm. It is computational intensive and has excellent performance for moderate number of variables after fine-tuning. Logistic Regression Logistics regression describes the relationship between categorical target variable and independent variables by estimating the probability from a cumulative logistic distribution. Neural Network Neural network is a parametric model that accommodates a wider variety of nonlinear relationships. Neural network also keeps checking the curse of dimensionality problem which bedevils attempts to model non-linear functions with large number of variables. Data set 5043 movie titles 28 variables The data set was scrapped from IMDB using Python’s scrappy library. This resulted in 5043 observations of 28 variables. Random Forest Random forest is ensemble of decision trees. It averages the predictive probability of a large number of over trained decision trees, thus is more robust against overfitting and more generalized than a single decision tree. Most influential factors 2 nd Most influential factors 3 rd Most influential factors Least influential factor 2 3 4 Target percentages show how accurate the model’s predictions are towards future data set. Outcome percentages, on the other hand, indicate the accuracy of model prediction for the sample data set. For Gradient Boosting and Neural Network, the Outcome 1/1 percentages are above 75%, which means the models have successfully predicted 75% of the breakeven movies. The Target 1/1 percentages are above 70%, which means the models predictions are reliable. Hence, Gradient Boosting and Neural Network are the models chosen to predict the breakeven status of the future movies in the U.S. market. Misclassification rate takes the false positives and the false negatives into consideration. Of all the models, Gradient Boosting has the lowest misclassification rate. This is not surprising given the delicate algorithm that seeks to minimise the intermediate pseudo- residuals rather than simply relying on one splitting criterion like in Decision Tree and Random Forest. Neural Network 2 works the second best, proving that its complicated algorithm which imitates human mind indeed has some advantage in building predictive models. The analysis and data set are highly reliant on online data given that it is extracted from a movie rating website. This is however is not the only defining factor. Hence, further analysis on predicting movie successes should also take into consideration traditional promotional channels such as theatre data. Additionally this data is collected over a period of time and when it comes to movies, popularity of the movie grows over a period of time. Hence for a more accurate analysis, time-stamps of the metrics must be collected and taken into consideration. The most important insight from the above predictive analysis is that online popularity of a movie is the best indicator of its success IMDB is a sought after site for movie opinions and hence movie votes, critic reviews and general public reviews are the greatest influencers For Facebook likes Actor 3 Facebook likes are a better indicator than actor 2 and actor 1 Facebook likes. % future work

Upload: priyadarshini-majumdar

Post on 09-Jan-2017

38 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PowerPoint Presentation

Model Comparison

Movie Breakeven Analysis In

U.S Market

Liu Jialin | Priyadarshini Majumdar | Zhang Jiexi

Data Analytics Lab Project Challenge from Nov 23rd onwards at a theatre near YOU

INTRODUCTION

METHODOLOGY

What plays the most important role

in making a movie profitable?

Movie technical

Language 4.3/10 Content rating 4/10 Aspect ratio 3/10 Budget 2.5 /10 Duration 1.5/10 Colour or B&W 1/10

IMDB website Influence

No of IMDB users who voted 9/10 No of users reviewed 8/10 No of critics for reviews 6/10 IMDB score 5/10

Facebook influence

• Movie Facebook likes 4.5/10• Actor 3 Facebook likes > Actor 2 > Actor 1• Cast total Facebook likes 3.5/10• Director Facebook likes 3.2/10

Poster and Promotional

materials

No of faces in a poster 2.6/10

objectives

Data Processing1

Remove repetitive entries in JMP.

Calculate gross profit=

Create the binary Profit/Loss target

variable and remove missing values.

SAS Enterprise Miner:

• Import the JMP file using File

Import and Save Data nodes.

• Change the level for Aspect Ratio to nominal in the File Import node.

• Conduct text parsing,

text clustering and text filter on

plot key words and genres.

• Use Multiplot node to view the distribution of the variables.

• Recode missing values and

erroneous entries using

Replacement node.

• Sample the data into Training Set and Validation Set using the Data

Partition node.

Before running the parametric

models, fill in all missing values using the Impute node and transform

the interval variables with skewed

distributions using the Transform node.

Predictive Model Construction

Decision Tree

Applying nonparametric algorithm, decision tree is capable of fitting a large number of

functional forms and mapping observations to categorical targets.

Model Comparison

Conclusion

Background: Movies are one of the top grossing industries in the world today and in the U.S. itself it is a 38 billion dollar market as of 2016

Motivation:IMDB is one of the top visited sites through which viewers often decide whether to watch a movie or not. Hence this has

a direct effect on whether a movie will profit or loss.

Primary Objective: To develop a model that can predict whether a movie will break even in the U.S. market or not.

Secondary Objective: To relay to promoters who use social media for movie promotion on which factors affect the outcome of the movie

Confusion Matrix for Model Comparison

Gradient Boosting

A Gradient Boosting model builds up a strong learning tree from a base set of weak

learning trees, using Gradient Descending algorithm. It is computational intensive and has excellent performance for moderate number of variables after fine-tuning.

Logistic Regression

Logistics regression describes the relationship between categorical target variable and

independent variables by estimating the probability from a cumulative logistic distribution.

Neural Network

Neural network is a parametric model that accommodates a wider variety of nonlinear

relationships. Neural network also keeps checking the curse of dimensionality problem which bedevils attempts to model non-linear functions with large number of variables.

Data set

5043 movie titles

28 variables

The data set was scrapped from

IMDB using Python’s scrappylibrary. This resulted in 5043

observations of 28 variables.

Random Forest

Random forest is ensemble of decision trees. It averages the predictive probability of

a large number of over trained decision trees, thus is more robust against overfitting and more generalized than a single decision tree.

Most influential

factors

2nd Most influential

factors

3rd Most influential

factors

Least influential

factor

2 3

4

Target percentages show how accurate the model’s predictions are

towards future data set. Outcome percentages, on the other hand, indicate the accuracy of model prediction for the sample data set. For

Gradient Boosting and Neural Network, the Outcome 1/1 percentages

are above 75%, which means the models have successfully predicted 75% of the breakeven movies. The Target 1/1 percentages are above

70%, which means the models predictions are reliable. Hence, Gradient

Boosting and Neural Network are the models chosen to predict the breakeven status of the future movies in the U.S. market.

Misclassification rate takes the false positives

and the false negatives into consideration. Of all the models, Gradient Boosting has the

lowest misclassification rate. This is not

surprising given the delicate algorithm that seeks to minimise the intermediate pseudo-

residuals rather than simply relying on one

splitting criterion like in Decision Tree and Random Forest. Neural Network 2 works the

second best, proving that its complicated

algorithm which imitates human mind indeed has some advantage in building predictive

models.

The analysis and data set are highly reliant on online data given that it is extracted

from a movie rating website. This is however is not the only defining factor.

• Hence, further analysis on predicting movie successes should also take into consideration traditional promotional channels such as theatre data.

• Additionally this data is collected over a period of time and when it comes to

movies, popularity of the movie grows over a period of time. Hence for a more

accurate analysis, time-stamps of the metrics must be collected and taken into consideration.

• The most important insight from the above predictive analysis is that

online popularity of a movie is the best indicator of its success

• IMDB is a sought after site for movie opinions and hence movie votes, critic reviews and general public reviews are the greatest influencers

• For Facebook likes Actor 3 Facebook likes are a better indicator than

actor 2 and actor 1 Facebook likes.

𝑔𝑟𝑜𝑠𝑠−𝑏𝑢𝑑𝑔𝑒𝑡

𝑏𝑢𝑑𝑔𝑒𝑡%

future work