Movie Score Portal

Welcome to the central intelligence hub. Select a module below to analyze movie performance and audience personas.

πŸ“Š Platform Variance Report πŸ“ˆ Executive MS Index 🎯 Audience DNA Radar ⬇️ Download Project Source Code
Part 1

Importing Score Data & Computing the Movie Score (MS)

Data Source: OMDb API

Instead of building separate scrapers or API integrations for each platform, a centralized source is used: the OMDb API. A single API call per movie returns ratings from IMDB, Rotten Tomatoes, and Metacritic simultaneously β€” clean, structured, and reliable.


Step 1 β€” Extract

For each movie in the target list, the OMDb endpoint is called and the response is parsed into a raw record:

{ "Title": "Inception", "imdbRating": "8.8", // IMDB β†’ out of 10 "Metascore": "74", // Metacritic β†’ out of 100 "Ratings": [ {"Source": "Rotten Tomatoes", "Value": "87%"} ] }

This is written to data/raw/movies.csv β€” the raw landing table:

TitleIMDBRottenTomatoesMetacritic
Inception8.887%74
Titanic8.088%75
The Avengers8.091%69

Step 2 β€” Transform & Normalize

The three platforms use different scales. Everything is normalized to a 0–100 common scale before combining:

β€’ IMDB_norm = IMDB Γ— 10 β†’ 8.8 becomes 88
β€’ RottenTomatoes_norm β†’ strip the %, already 0–100 β†’ 87
β€’ Metacritic_norm β†’ already 0–100 β†’ 74

TitleIMDB_normRottenTomatoes_normMetacritic_norm
Inception88.087.074.0
Titanic80.088.075.0
The Avengers80.091.069.0

Step 3 β€” Compute MS

The Movie Score is the simple average of the three normalized scores, written to data/processed/movie_scores.csv:

df['MovieScore'] = df[['IMDB_norm', 'RottenTomatoes_norm', 'Metacritic_norm']].mean(axis=1)
TitleMovieScore
Inception83.0
Titanic81.0
The Avengers80.0

Pipeline Orchestration: Apache Airflow

All three steps are wired in a DAG with a @daily schedule running inside Docker Compose:

extract_data β†’ transform_data β†’ compute_ms

Every day, new scores are fetched automatically and the MS updates without any manual intervention.


Part 2

Predicting the Score of an Unreleased Movie

Since an unreleased film has no IMDB, Rotten Tomatoes, or Metacritic scores yet, a machine learning regression model is trained on historical data from films that have already been released.

Step 1 β€” Build a Historical Training Dataset

A table combining pre-release metadata with the actual MS each film eventually received:

film_idgenredirector_avg_mslead_actor_avg_msbudget_usdstudio_tiertrailer_views_7drelease_seasonactual_ms
tt1234Sci-Fi81.277.5165MA12.4MSummer83.0

Step 2 β€” Feature Engineering (Pre-Release Signals)

β€’ Director & cast history β€” average MS of their previous films
β€’ Studio quality score β€” the studio's average MS over the past 5 years
β€’ Production budget β€” higher budgets tend to correlate with higher production quality
β€’ Trailer engagement β€” YouTube/social media views and like ratios in the first week
β€’ Genre β€” different genres have different historical MS distributions
β€’ Release timing β€” summer blockbuster vs. awards season (Nov–Dec)
β€’ Screenwriter track record β€” same logic as director history

Step 3 β€” Model Selection

β€’ Random Forest / XGBoost β€” captures non-linear feature interactions, interpretable via feature importance
β€’ Ridge Regression β€” robust against overfitting when data is limited
β€’ LightGBM β€” fast and highly effective on larger historical datasets

Step 4 β€” Output: Score + Confidence Interval

Rather than a single number, the model outputs a range:

Predicted MS: 76 Β± 5 (95% confidence interval: 71–81)

Step 5 β€” Real-Time Signal Updates

As the release date approaches, the prediction is progressively refined with new signals: critics' screening reactions, film festival reception (Sundance, Cannes, TIFF), social media sentiment analysis, and advance ticket sales data.

Step 6 β€” Airflow Integration

A new DAG monitors an "upcoming films" table. When a new entry appears, it triggers feature collection, runs the model, and writes a predicted_ms column. Once the film is released and the real MS is computed, a prediction_error column is populated β€” providing a continuous feedback loop to retrain and improve the model over time.


Summary

Part 1Part 2
DataLive API scores (IMDB, RT, Metacritic)Historical MS + pre-release metadata
ProcessingNormalize β†’ AverageFeature engineering β†’ ML regression
OutputExact MSPredicted MS with confidence interval
OrchestrationAirflow DAG (@daily)Airflow DAG (event-triggered)
Storagemovies_clean.csv β†’ movie_scores.csvupcoming_films β†’ predicted_ms column