Data Analysis of IMDB Data
[youtube https://www.youtube.com/watch?v=mS3dzczv1ZQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&start=1&wmode=transparent]
We all are surrounded by data and it reveals lot of things to us to make our decisions and recommends the next steps. Data is collected from different sources such as Web, Database, log files etc. and then it is thoroughly cleaned and reshaped, and further used for analysis and explored to determine the hidden patterns and trends which is really essential for any business decision making, Extracting data from web is always easy with the help of API’s but what if website doesn’t provide any API’s, In such case, Web Scraping is an excellent way to extract the unstructured data from web and put that in structured format like excel,csv, database etc..
Web Scraping with Selenium WebdriverPermalink
- if there is any content on the page rendered by javascript then Selenium webdriver wait for the entire page to load before crwaling whereas other libs like BeautifulSoup,Scrapy and Requests works only on static pages.
- Any browser actions can be done with the help of Selenium webdriver, if there is any content on the page displayed by on button click or Scrolling or Page Navigation
Using IPython NotebookPermalink
Some ImportsPermalink
Open Page in Chrome Browser
Python Function to Scrap Data using Selenium WebdriverPermalink
Just provide the address(Xpath or any other locator) of the data to be extracted and Selenium webdriver extracts all the data from the page just with the help of one api(find_element_by_xpath), See how easy it is. Isn’t it?
Data is extracted from the page with the help of webdriver and is stored in a list, So individual list for all the following data is created:
- Movie Name
- Release Year
- IMDB Rating
- Votes
- Director
- Genre
How does Extracted data looks like?Permalink
Sample Data set for Movie Name, Votes and Director is displayed here, Rest of the data is also stored in individual python list
**_[‘Bajirao Mastani’, ‘Queen’, ‘Bhaag Milkha Bhaag’, ‘Barfi!’, ‘Zindagi Na Milegi Dobara’]_**
**_[‘17,362’, ‘39,518’, ‘39,731’, ‘52,308’, ‘41,731’]_**
**_[‘Director: Sanjay Leela Bhansali’, ‘Director: Vikas Bahl’, ‘Director: Anurag Basu’]_**
Don’t see any co-relation between these data, if someone have to pick the release year and director for a movie then it’s difficult to get it from these lists, So let’s put the data in a Structured and more meaningful format which will make sense for someone looking at this data. So lets bind the data in Python dictionary
Python DictionaryPermalink
Structured DataPermalink
{
“Director”: “Director: Sanjay Leela Bhansali”,
“Votes”: “17,362”,
“RunTime”: “A historical … (158 mins.)”,
“Year”: 2015,
“Genre”: “Drama”,
“Movie Name”: “Bajirao Mastani”,
“Rating”: “7.2”
}
This Data in python dictionary(Key:value pair) looks good and make more sense now, However if you look carefully the data is not in correct format for data manipulation, Votes value contains comma, Director contains unwanted text “Director:” and Ratings and Runtime are not in correct data type. Lets Clean this data to bring it in shape for performing analysis
Data CleansingPermalink
{
“Director”: “Sanjay Leela Bhansali”,
“Votes”: 17362,
“RunTime”: 158,
“Year”: 2015,
“Genre”: “Drama”,
“Movie Name”: “Bajirao Mastani”,
“Rating”: 7.2
}
Data in Pandas DataframePermalink
The entire movie data is stored in python dictionary but for doing further analysis this data needs to be consumed by Pandas Dataframe so that by using Pandas rich data structures and built-in function we can do some analysis on this data. Import data in Dataframe.
Records with Missing ValuesPermalink
There are some missing values in this data, But Pandas provides excellent feature to handle missing and null values. So for these 3 movies RunTime data is not available on the page. so for further analysis we will replace this missing data with the mean value of the available data for RunTimeColumn
Replace Missing Values with MeanPermalink
Movies with Highest RatingsPermalink
Top five movies since 1955Permalink
Ratings Trending GraphPermalink
Movies with Lowest RatingsPermalink
Movies with Maximum Run TimePermalink
Top 10 movies
RunTime Trending GraphPermalink
Average Movie RunTimePermalink
Movies IMDB RatingsPermalink
Movies with rating Greater than 7
Ratings Visualization using Bar GraphPermalink
Percentage distributionPermalink
Best Movies By GenrePermalink
Directors won more than OncePermalink
Conclusion:Permalink
Movies most likely to be selcted for Best PicturePermalink
- Rating greater than 7
- Run time more than 2hrs
- Category Drama and Musical