Measuring on-screen portrayal: Guidelines for evaluation

Computer vision is increasingly being used to analyze visual media and measure the prominence of on-screen representation. This approach, relevant to media regulators, broadcasters, researchers, and film/TV fans alike, focuses on three aspects of diversity often referred to as the '3P's - presence, prominence, and portrayal.

By moving from presence to prominence and portrayal, computer vision can bring new value and prompt new questions. The framework excludes off-screen inclusion and is most applicable to groups falling under Equality Act protected characteristics, such as gender, gender identity, age, ethnicity, sexual orientation, and disability.

Measuring Prominence

Computer vision can measure the prominence of on-screen representation by analyzing visual features such as object location, size, and saliency within the frame. Techniques like object detection and segmentation, semantic and instance segmentation, and saliency maps are central to this process.

Object detection and segmentation models can identify and locate various on-screen entities (people, objects) by drawing bounding boxes or pixel-level masks around them. The size and position of these detected objects relative to the entire screen can quantify their visual prominence.

Saliency maps highlight regions that attract viewers’ attention either through human visual behavior modeling or by revealing the focus areas of computer vision models. They can measure which parts of the screen are most visually significant, indicating the prominence of different on-screen elements.

Deep learning models, such as Convolutional Neural Networks (CNNs) and Residual Neural Networks, can extract hierarchical features from images to detect and classify objects, enabling the assessment of prominence by quantifying how much of the screen they occupy or how much emphasis the model places on these regions.

In practical terms, prominence assessment might involve running an object detection or segmentation model to identify all people or key objects on screen, calculating the relative size (area) of these detections as a percentage of screen space, using saliency maps to determine which objects attract the most attention, and aggregating these metrics to provide a quantitative measure of how prominent each on-screen representation is.

These techniques enable automated, scalable, and objective measurement of representation in visual media, useful for studies of screen time diversity, bias, or content analysis.

Considerations and Recommendations

The framework proposes three questions to consider when measuring representation: what aspect of diversity is being measured, who is being tracked on-screen, and how are character occurrences identified. It also raises considerations around feasibility (can we) and ethics (should we) when tracking a group on-screen and using visual approaches.

Computer vision models can speed up the identification of character occurrences, but should not be used to infer demographic attributes. Technical recommendations and data standards specific to representation metrics can be developed in the longer term.

Intersectionality refers to the interconnected nature of social categories, such as understanding how race/ethnicity, socio-economic background, disability, and other underrepresented groups intersect within gender. The optimal model parameters for clustering faces with computer vision vary by the type of program being analyzed.

Programs with higher variance in viewpoint, more crowds, and darker lighting may give less reliable clustered faces. Data compilation methods for representation analysis can vary, with each method capturing different aspects of diversity.

In the next blog, computer vision will be demonstrated for measuring the relative prominence of people on screen, and the framework advises against computationally inferring characters' or people's demographics, but computer vision can still be used to identify character occurrences.

Computer vision is instrumental in quantifying the prominence of on-screen representation, focusing on factors like object location, size, and saliency within the frame.
Techniques such as object detection and segmentation, semantic and instance segmentation, and saliency maps are crucial for analyzing visual media and measuring visual prominence.
The analysis of visual data and the application of deep learning models like Convolutional Neural Networks (CNNs) and Residual Neural Networks can provide valuable insights into the prominence of on-screen elements.
By assessing the size and position of detected objects, calculating screen space percentage, and utilizing saliency maps, computer vision offers an automated, scalable, and objective approach to measuring representation in various industries, including media, research, and education.
Beyond measuring prominence, the computer vision framework encourages questions about diversity aspects, on-screen representation, and the feasibility and ethics of data tracking and analysis.
Intersectionality should be considered when developing computer vision models, as the optimal parameters for clustering faces can differ depending on the type of program being analyzed and factors such as viewpoint, crowd density, and lighting conditions.
Data compilation methods for representation analysis should be carefully chosen, as they can impact the aspects of diversity captured by computer vision models, and the inference of demographic attributes should be avoided.

Measuring on-screen portrayal: Guidelines for evaluation