What’s Really on Netflix? A Data Deep-Dive into the Streaming Giant
Introduction
As one of my very first independent data analysis projects, I wanted to dive into a dataset that was not only rich in volume but also grounded in real-world relevance. The Netflix dataset turned out to be a perfect fit: over 6,000 titles, plenty of messy (and realistic) data, and a topic I, like many others, found genuinely interesting. After all, Netflix is arguably the most globally recognized streaming platform.
This project was more than just an attempt to uncover trends or answer questions, it was my personal introduction to the world of data analysis. My main goal was to learn by doing: to get hands-on experience, build confidence, and become more comfortable working with real data. I wanted to understand what it really means to explore, clean, and analyze a dataset from scratch.
That said, the dataset did offer plenty of interesting questions: What exactly does Netflix’s catalog consist of? Are movies or TV shows more prevalent? How has its content evolved over time? Which genres dominate, and which countries, actors, and directors shape the platform’s global reach? Through a mix of cleaning, visualizing, and exploring, I uncovered a number of trends, surprises, and storytelling opportunities, all while developing the foundational skills every data analyst needs.
View full code on Github.
1. Data overview and cleaning
Before diving into any deeper analysis, I began with a systematic exploration of the dataset to verify it loaded correctly and to understand its structure and quality.
Following this initial overview, I focused on cleaning the dataset by identifying and handling missing values and duplicates, and reformatting key columns to make the data easier to work with and more accessible for analysis.
What I did:
- Inspected the dataset’s structure using
.head()
,.tail()
,.info()
, and.columns
to preview the data, inspect data types, and assess completeness. - Checked for null values across all variables with
.isnull().sum()
. - Removed or handled missing data depending on the column and context.
- Reformatted
date_added
into proper datetime format and extractedyear_added
andmonth_added
to enable temporal analysis.
movies["date_added"] = movies["date_added"].str.strip()
movies["date_added"] = pd.to_datetime(movies["date_added"])
movies["year_added"] = movies["date_added"].dt.year
movies["month_added"] = movies["date_added"].dt.month
- Parsed the
duration
column into:duration_minutes
for movies (e.g., “90 min”)seasons
for TV shows (e.g., “2 Seasons”)
movies["duration_minutes"] = movies["duration"].apply(lambda x: int(x.split(" ")[0]) if "min" in x else None)
movies["seasons"] = movies["duration"].apply(lambda x: int(x.split(" ")[0]) if "min" not in x else None)
movies = movies.drop(columns=['duration'])
Findings:
Several columns (like
director
,cast
,country
) had substantial missing values, typical of real-world datasets.
Missing data for each variable could be explained by different reasons. For instance, the significant amount of missing data for director
might be ralated to the lack of an actual director for some kind of shows, like reality shows. Indeed, TV productions typically involve multiple directors across episodes, making it harder to assign a single name. A closer look to the observations with missing values for the director
variable reveals that the majority of them were actually TV Shows, with 2226 observation against 163 for Movies.
On the other hand, more movies than TV shows are missing cast data (426 versus 292), possibly because cast listings for TV shows are more prominently featured and curated, while smaller, less known, or older movies may lack complete metadata.
Moreover, values in country
could be missing due to lack of information or multi-country production.
Finally, since the duration_minutes
and seasons
columns were created during the data cleaning process, where I separated the original duration field into two distinct variables, the presence of missing values in each is expected: if the entry refers to a movie, it logically lacks a seasons value, and if it’s a TV show, it doesn’t have a duration in minutes.
Notwithstanding the significant amount of null values for some variables, removing these rows entirely could lead to loss of valuable data. Therefore, I handled them based on the analysis context, by creating temporary filtered subsets (excluding nulls) when performing variable-specific analyses, like identifying the most frequent directors or countries.
2. Distribution of Contents
Netflix’s catalog is made up of two main types of content: Movies and TV Shows. When breaking down the dataset, I found a clear split between the two, with Movies representing almost two thirds of the content on the platform. In absolute terms, Neflix hosts 5377 movies and 2410 TV shows.
3. Content trends Over Time
While both Movies and TV Shows are well represented, it is interesting to see the content trend over the years. By organizing both of them by year and month they were added to the platform, I was able to spot some interesting trends and evolutions. From 2008 to 2015, the growth in content remained relatively flat. However, starting in 2016, there is a noticeable uptick in the volume of both Movies and TV Shows added to the platform. This surge likely coincides with Netflix’s global expansion and the increasing investment in original productions, which significantly boosted the platform’s popularity. The upward trend continues sharply in the following years, especially for Movies, which show a steeper growth trajectory compared to TV Shows. Despite a slower pace, the increase in TV Shows remains substantial.Interestingly, between 2020 and 2021, the final years captured in this dataset, the trend flattens again, and the number of both movies and tv shows on the platform almost remain invariate. This stabilization may reflect the broader shifts in the streaming landscape, including the emergence of strong competitors and a more saturated market environment.