The dangers of video metadata and how to avoid them

Thomas Ward August 16, 2020

Short summary

Metadata, or "data about data," can uniquely identify patients. Surgical videos used to train AI systems can store large amounts of metadata unknown to the surgeons and scientists who use the videos which could lead to accidental patient privacy breaches. All efforts to maximally "de-identify" surgical datasets must be taken to protect our patients' privacy. To help address this, I wrote a program called deidentify_videos.py available on my GitHub that uses FFmpeg to quickly strip metadata from video collections.

Introduction to metadata and its issues

Metadata is confusingly defined as "data about data" but best understood through an example. Take text messages: the "data" is the contents of the message, like "Hi, where do you want to meet up?" The text message's metadata includes the messages recipients, what time you sent it, what cell towers the message routed through to make it from your phone to the other phone, etc. Even without reading your text message contents, your cellular networks know a vast amount of information about you through the metadata alone!

Metadata is a powerful identifier of people. You can uniquely identify 87% of US citizens by their zip code, date of birth, and gender alone.^[1] This makes "anonymized" patient databases not really anonymized at all. Through the data available in a newspaper story, researchers could identify 40% of the newspaper articles subjects and retrieve all the information on their hospital stay in an "anonymized" database in Washington.^[2]

True "anonymization" is a myth. I prefer to use the term "de-identification," which is the process of removing identifying information from data. Complete "de-identification" of data is impossible, but avoiding the term "anonymization" prevents the false belief that the data is now anonymous so you can handle it less carefully (sharing publicly, storing on not secure devices, etc.).

AI researchers use tremendous amounts of data to teach AI programs. For researchers in the healthcare space, this becomes highly dangerous: identifiers may lurk in the data which may lead to inadvertent privacy violations for the patients. As the push for sharing data and algorithms to validate AI becomes stronger, we will need to release our data to the public. This is a good thing: open science is better science, and the collective inspection of the community can catch errors. However, extreme vigilance will be necessary to prevent inadvertent privacy leaks and harm to our patients.

Metadata in videos

Surgical videos are rich sources of information: a minute of surgical video has 25 times the amount of information as a high-resolution CT scan.^[3] Obvious identifiers in video data include the camera showing the clock on the wall and therefore time of the surgery, the patient's sex through visualization of anatomical structures, the patient's skin color, signs in the operating room, etc. As much as possible, these should be blurred or removed prior to sharing. More insidious identifiers, though, are present in the metadata.

At MGH's Surgical AI and Innovation Laboratory (Saiil), we use surgical video to teach computers surgery, which has led to a some surprising findings on how much "hidden" data each video file stores. One metadata source is a video's filename. Many OR system recorders save videos with names like "video-date-time.mp4," which is an almost unique identifier linking the surgery to a patient.

A less obvious source is the metadata stored in the video file itself. Video files consistent of three major components:

A video stream or streams
An audio stream or streams
A container that holds all the streams in one file (eg mp4)

Each component can store metadata. Take an example video recorded by one OR recording device (this is not a patient video just a test recording I made for demonstration purposes). When you run the ffprobe command in the FFmpeg software suite it shows the following:

$ ffprobe test_video.mp4
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'test_video.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isom
    creation_time   : 2020-08-14T12:37:30.000000Z
    title           : Anonymous patient YYYY-MM-DD ID MF
    comment         : Input DVI 1920x1080p59.94 Q=0
    artist          : Anonymous
    album           :
    composer        : MediCap USB300,SN 3014279,FW 140423 MediCapture Inc.
  Duration: 00:00:21.23, start: 0.000000, bitrate: 11542 kb/s
    Stream #0:0(eng): Video: h264 (Baseline) (avc1 / 0x31637661), yuvj420p(pc), 1920x1080 [SAR 1:1 DAR 16:9], 11582 kb/s, 29.97 fps, 29.97 tbr, 60k tbn, 120k tbc (default)
    Metadata:
      creation_time   : 2020-08-14T12:37:30.000000Z
      encoder         : VC Coding

You'll see the following data:

Time and date the video was recorded in both the video stream's metadata and the video container's metadata (Aug 14, 2020 at 12:37 in this test case)
The title which tells you it contains patient data (the title is "Anonymous patient")
The recording device used (a Medicapture USB300 in this case).
The length of the video (21.23 seconds for this one)

Without even looking at the video itself you already know exactly when the video was recorded and some sense of where by knowing the device used to record Different hospitals use different recorders even within the same hospital (as we do at MGH) so this means you can localize where the patient had their operation. If the video was of a certain length, you could even guess at what type of case it was.

Some recorders can include even more data (especially those tightly integrated with the electronic medical record which could embed unique medical record numbers for each patient into the video). This metadata is a hidden and insidious privacy nightmare that we must remove when gathering surgical video to protect our patients. I provide some guidance on how to do so in the next section.

How to de-identify a video with FFmpeg

The gold standard software tool for manipulating videos is FFmpeg.^[4] It is a powerful piece of software and can efficiently strip excess meta-data from your videos. Unfortunately, it is, to put mildly, extraordinarily difficult to use (just look at the documentation page for starters) and takes numerous search engine trips, Stack Overflow questions, and deep dives into obscure video recording internet forums, to figure out the command line options to make your video editing wishes come true.

To strip metadata from a video with FFmpeg, the following command will work:

$ ffmpeg -i input_vid.mp4 -map 0:v -map 0:a? -c copy -map_metadata -1 \
        -map_metadata:s:v -1 -map_metadata:s:a -1 -map_chapters -1 \
        -disposition 0 stripped_metadata_vid.mp4

Here's the command broken down:

ffmpeg -i input_vid.mp4: calls ffmpeg to work on input_vid.mp4
-map 0:v: keep only the video streams
-map 0:a?: optional, will allow you to keep audio stream if present
-c copy: do not transcode, just copy the streams (makes this process orders of magnitude faster)
-map_metadata -1: removes the global container metadata
-map_metadata:s:v -1: removes the video stream metadata
-map_metadata:s:a -1: removes the audio stream metadata (works even if there is no audio stream)
-map_chapters -1: removes any video chapter information
-disposition 0: removes any disposition metadata (minimal metadata by why not remove)
stripped_metadata_vid.mp4: name of video to create that is input_vid.mp4 without any unnecessary metadata.

Here is the output from a video (previously seen in the article) without metadata stripped:

$ ffprobe test_video.mp4
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'test_video.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isom
    creation_time   : 2020-08-14T12:37:30.000000Z
    title           : Anonymous patient YYYY-MM-DD ID MF
    comment         : Input DVI 1920x1080p59.94 Q=0
    artist          : Anonymous
    album           :
    composer        : MediCap USB300,SN 3014279,FW 140423 MediCapture Inc.
  Duration: 00:00:21.23, start: 0.000000, bitrate: 11542 kb/s
    Stream #0:0(eng): Video: h264 (Baseline) (avc1 / 0x31637661), yuvj420p(pc), 1920x1080 [SAR 1:1 DAR 16:9], 11582 kb/s, 29.97 fps, 29.97 tbr, 60k tbn, 120k tbc (default)
    Metadata:
      creation_time   : 2020-08-14T12:37:30.000000Z
      encoder         : VC Coding

And here it is after it's been stripped of metadata:

$ ffmpeg -i test_video.mp4 -map 0:v -map 0:a? -c copy -map_metadata -1 \
        -map_metadata:s:v -1 -map_metadata:s:a -1 -map_chapters -1 \
        -disposition 0 stripped_metadata_vid.mp4

$ ffprobe stripped_metadata_vid.mp4

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'stripped_metadata_vid.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.45.100
  Duration: 00:24:20.66, start: 0.000000, bitrate: 11748 kb/s
    Stream #0:0(und): Video: h264 (Baseline) (avc1 / 0x31637661), yuvj420p(pc), 1920x1080 [SAR 1:1 DAR 16:9], 11747 kb/s, 29.97 fps, 29.97 tbr, 60k tbn, 120k tbc (default)
    Metadata:
      handler_name    : VideoHandler

All identifying meta-data is now removed from the video, and your video is now one step closer to maximal de-identification.

Streamlining de-identification

The above FFmpeg command works. However, it is complex, easy to miss an option and accidentally leave some metadata, and a pain to use for multiple videos. It also fails to remove the metadata stored in the name of the video. To solve both issues, I wrote a small program in Python that leverages FFmpeg called deidentify_videos.py. It is located here in my tmw-misc repository.

What deidentify_videos.py does is take a directory full of videos (and even videos located within deeper directories), create random new names for them (either into a UUID4 or a sequential name like video001.mp4, video002.mp4, etc), strip all metadata, then save the video into a new directory. Below is a sample use:

deidentify_videos.py videos/ stripped_metadata_videos/

This will create a directory stripped_metadata_videos, randomize the names as UUID4s, and output a CSV so you can translate the original names to the randomized names. This CSV has the following structures:

original,randomized
videos/vid-20200813.mp4,stripped_metadata_videos/fbb4b9a25d13419a387b779ffa51c022.mp4
videos/vid-20200814.mp4,stripped_metadata_videos/fcd8b3afdd13419aadcaaee432551aeb.mp4

If you want to generate sequential names rather than UUIDs, add the command line switch -s to the program's invocation as follows:

deidentify_videos.py -s videos/ stripped_metadata_videos/

and the CSV will look like the following:

original,randomized
videos/vid-20200813.mp4,stripped_metadata_videos/video03.mp4
videos/vid-20200814.mp4,stripped_metadata_videos/video01.mp4
videos/vid-20200815.mp4,stripped_metadata_videos/video02.mp4

Note that the video numbers will be a sequence (e.g. 1 to 3) but they are shuffled compared to the order of the videos in the original directory to prevent any information leakage (e.g. someone could tell that video02 was filmed before video01 as a piece of information).

Lastly, if you like the names of your videos and want to only strip metadata and keep the names, just use the program with the -m flag as below:

deidentify_videos.py -m videos/ stripped_metadata_videos/

This CSV logging the filenames change will then look like the following:

original,randomized
videos/vid-20200813.mp4,stripped_metadata_videos/vid-20200813.mp4
videos/vid-20200814.mp4,stripped_metadata_videos/vid-20200814.mp4
videos/vid-20200815.mp4,stripped_metadata_videos/vid-20200815.mp4

Requirements for `deidentify_videos.py`

In order for the program to work, it will require you to have the following:

Python 3.6 or higher
docopt python package
FFmpeg installed

Comments, questions, input, concerns?

Please contact me with any questions or input on the article using any of the methods on my contact page.

References

Sweeney L. Simple demographics often identify people uniquely. Health (San Francisco). 2000;671:1-34. ↩
Sweeney L. Only You, Your Doctor, and Many Others May Know. Technology Science. Published online September 29, 2015. Accessed September 12, 2019. /a/2015092903/ ↩
Natarajan P, Frenzel JC, Smaltz DH. Demystifying Big Data and Machine Learning for Healthcare. CRC Press, Taylor & Francis Group; 2017. ↩
FFmpeg. Published September 12, 2019. Accessed September 12, 2019. https://www.ffmpeg.org/ ↩