IBM Data Science Challenge


Problem Statement :

Given the data of Bollywood Movies, please perform one or more of the following data analysis tasks :
  • Enable multi modal Question Answer on top of this dataset.
    User should be able to ask questions(in text format), and the output should be text and/or image.
    User may also provide an image as an input, and the output should be the plot/points relevant to that image.
  • Convert the movie plot into entity-relation ship graph where each path traversal provides a different story arc of the movie
  • The data set has been used to show gender bias present in bollywood(
  • You may extend this work to change the movie text to be gender neutral.
  • Can you show a relationship between backdrop of the movie and gender bias presented in the movie?
    For example - Is gender bias more prevalent in movies with a rural setting than with an urban one?
  • You may come up with any other innovative use of the dataset which leverages and proposes new text, image or video task and early solutions.

Dataset :
Please find the link to the data -

Data Description :

This repository contains three types of Bollywood Data:
  • scripts-data
  • trailers-data
  • wikipedia-data

Trailers data
This dataset contains the gender and emotion data for all Bollywood Movie Trailers released from 2008 to 2017.

The following dataset includes the folder :
individual-trailer-data: It has the gender data and emotion data detected at each frame for the trailer video.
The repository also includes the following files :
  • trailers_list.csv: Contains movie names and year of release of all the trailers in the dataset
  • complete-data.csv: It has gender and emotion information for each of the trailers in the data folder. It has the following columns :
    1. frame_number - the frame number of the trailer in which emotion and gender detection occurred
    2. man/woman - whether the detected person was a man or a woman
    3. emotion - the emotion potrayed by the man/woman detected in the image
    4. year - the year in which the movie was released
    5. movie_name - the name of the movie
  • : Compressed and zipped file of all indidividual trailer's data.

Wikipedia Data
This dataset contains data collected from wikipedia for bollywood movies. Also, it contains the data files which have been generated by processing the wikipedia output.
Details of each file is given as follows -
  1. avg_wv_relation.csv - Contains word vector relations data used in Inter sentence level
  2. coref_plot.csv - Contains coreferenced plot using OpenIE
  3. female_adjectives.csv - Contains adjectives used for females extracted using Stanford Dependency Parser
  4. female_adjverb.csv - Contains adjectives and verbs generated using Stanford Dependency Parser
  5. female_centrality.csv - Contains centrality for females across all movies in text
  6. female_mentions_centrality.csv - Contains centrality and mentions of females in movies
  7. female_verb.csv - Contains verbs used for females generated using Stanford Dependency Parser
  8. male_adjectives.csv - Contains adjectives used for males extracted using Stanford Dependency Parser
  9. male_adjverb.csv - Contains adjectives and verbs generated using Stanford Dependency Parser
  10. male_centrality.csv - Contains centrality for males across all movies
  11. male_mentions_centrality.csv - Contains centrality and mentions of males in movies in text
  12. male_verb.csv - Contains verbs used for males generated using Stanford Dependency Parser
  13. songsDB.csv - contains sountrack information
  14. songsFrequency.csv- contains soundtrack frequency
  15. image_and_plot_mentions_fequency.csv - contains poster mentions and text mentions for males and females in each movie

Submission Format:
Please submit a A2 sized poster explaining your work.

Submission Deadline: March 31, 2018

Register to submit

Contact Us

Research Showcase 2k18

Indraprastha Institute of Information Technology, Delhi
Okhla Industrial Estate,Phase III
(Near Govind Puri Metro Station)
New Delhi, India - 110020
Phone No: 91-11-26907400-7404 (5 lines)
Fax: 91-11-26907405