Company A | Movie Intelligence Portal

Welcome to the central intelligence hub. Select a module below to analyze movie performance and audience personas.

Part 1

Importing Score Data & Computing the Movie Score (MS)

Data Source: OMDb API

Instead of building separate scrapers or API integrations for each platform, a centralized source is used: the OMDb API. A single API call per movie returns ratings from IMDB, Rotten Tomatoes, and Metacritic simultaneously — clean, structured, and reliable.

Step 1 — Extract

For each movie in the target list, the OMDb endpoint is called and the response is parsed into a raw record:

{
  "Title": "Inception",
  "imdbRating": "8.8",        // IMDB → out of 10
  "Metascore": "74",          // Metacritic → out of 100
  "Ratings": [
    {"Source": "Rotten Tomatoes", "Value": "87%"}
  ]
}

This is written to data/raw/movies.csv — the raw landing table:

Title	IMDB	RottenTomatoes	Metacritic
Inception	8.8	87%	74
Titanic	8.0	88%	75
The Avengers	8.0	91%	69

Step 2 — Transform & Normalize

The three platforms use different scales. Everything is normalized to a 0–100 common scale before combining:

• IMDB_norm = IMDB × 10 → 8.8 becomes 88
• RottenTomatoes_norm → strip the %, already 0–100 → 87
• Metacritic_norm → already 0–100 → 74

Title	IMDB_norm	RottenTomatoes_norm	Metacritic_norm
Inception	88.0	87.0	74.0
Titanic	80.0	88.0	75.0
The Avengers	80.0	91.0	69.0

Step 3 — Compute MS

The Movie Score is the simple average of the three normalized scores, written to data/processed/movie_scores.csv:

df['MovieScore'] = df[['IMDB_norm', 'RottenTomatoes_norm', 'Metacritic_norm']].mean(axis=1)

Title	MovieScore
Inception	83.0
Titanic	81.0
The Avengers	80.0

Pipeline Orchestration: Apache Airflow

All three steps are wired in a DAG with a @daily schedule running inside Docker Compose:

extract_data → transform_data → compute_ms

Every day, new scores are fetched automatically and the MS updates without any manual intervention.

Part 2

Predicting the Score of an Unreleased Movie

Since an unreleased film has no IMDB, Rotten Tomatoes, or Metacritic scores yet, a machine learning regression model is trained on historical data from films that have already been released.

Step 1 — Build a Historical Training Dataset

A table combining pre-release metadata with the actual MS each film eventually received:

film_id	genre	director_avg_ms	lead_actor_avg_ms	budget_usd	studio_tier	trailer_views_7d	release_season	actual_ms
tt1234	Sci-Fi	81.2	77.5	165M	A	12.4M	Summer	83.0

Step 2 — Feature Engineering (Pre-Release Signals)

• Director & cast history — average MS of their previous films
• Studio quality score — the studio's average MS over the past 5 years
• Production budget — higher budgets tend to correlate with higher production quality
• Trailer engagement — YouTube/social media views and like ratios in the first week
• Genre — different genres have different historical MS distributions
• Release timing — summer blockbuster vs. awards season (Nov–Dec)
• Screenwriter track record — same logic as director history

Step 3 — Model Selection

• Random Forest / XGBoost — captures non-linear feature interactions, interpretable via feature importance
• Ridge Regression — robust against overfitting when data is limited
• LightGBM — fast and highly effective on larger historical datasets

Step 4 — Output: Score + Confidence Interval

Rather than a single number, the model outputs a range:

Predicted MS: 76 ± 5 (95% confidence interval: 71–81)

Step 5 — Real-Time Signal Updates

As the release date approaches, the prediction is progressively refined with new signals: critics' screening reactions, film festival reception (Sundance, Cannes, TIFF), social media sentiment analysis, and advance ticket sales data.

Step 6 — Airflow Integration

A new DAG monitors an "upcoming films" table. When a new entry appears, it triggers feature collection, runs the model, and writes a predicted_ms column. Once the film is released and the real MS is computed, a prediction_error column is populated — providing a continuous feedback loop to retrain and improve the model over time.

Summary

	Part 1	Part 2
Data	Live API scores (IMDB, RT, Metacritic)	Historical MS + pre-release metadata
Processing	Normalize → Average	Feature engineering → ML regression
Output	Exact MS	Predicted MS with confidence interval
Orchestration	Airflow DAG (@daily)	Airflow DAG (event-triggered)
Storage	movies_clean.csv → movie_scores.csv	upcoming_films → predicted_ms column

Movie Score Portal