Video sharing applications today lack the functionality for users to search videos by their content. As a solution I developed a searchable video library that processes videos and returns exact matches to queries using machine learning and artificial intelligence including speech recognition, optical character recognition, and object detection.
Applications for video sharing and storage may be able to enhance user experience by allowing users to search for videos by their content, such as specific words or objects in the video. One of the most popular video sharing apps right now is TikTok where users can save the videos they like to their profile but yet cannot search through the liked videos.
As it lacks that functionality, its millions of users are forced to scroll through every single video they have ever liked to find one single clip, and over again. To address this problem, I create a library of TikTok videos and build a search engine that breaks down the videos into several features and returns exact matches to any given query.
A sample of 140 videos are provided in the videos folder of this repository for the purpose of demonstrating the end-to-end process I performed. This set of sample is originally saved from my personal user account, and in addition, I downloaded two datasets each containing 1000 videos from Kaggle (found here and here). Altogether I analyzed over 2000 videos for the whole project, which I uploaded on Google Drive. You may download all the videos to explore the complete dataset.
Multimedia Data
A video is a complex data type that can be broken down in a lot of different ways. Through feature engineering, I turned the raw videos into multiple data features that I extracted using the following approaches:
Data Processing
moviepy
, pydub
, and speech_recognition
opencv-python
, PIL
, and pytesseract
opencv-python
and YOLOv3
algorithmUsing the above packages and models, the features are extracted as text and so I applied Natural Language Processing (NLP) to process the text and to create a corpus of all the words to search through. Lastly, I built the search engine using BM25
and deployed the full app via Streamlit.
The first feature to extract from the videos is audio. Audio processing is the fastest part in the full process of feature engineering.
Recognizer
class and call the method recognize_google
.transcribe_audio
.def transcribe_audio(file_path):
'''
Converts video to audio and returns audio transcription.
Parameters:
file_path (str): file path of video to be transcribed.
Returns:
full_text (str): full text transcription of video's audio.
'''
# Write the audio file from the video using MoviePy
# to convert the MP4 or MOV video format to a wav file
transcribed_audio_file = './data/audio/transcribed_audio.wav'
audioclip = AudioFileClip(file_path)
audioclip.write_audiofile(transcribed_audio_file)
try:
sound = AudioSegment.from_file(file_path, 'mp3')
except:
sound = AudioSegment.from_file(file_path, format='mp4')
# Split the wav file into chunks where there is silence
# for 500 milliseconds or more using PyDub
chunks = split_on_silence(sound, min_silence_len = 500,
silence_thresh = sound.dBFS-14, keep_silence=500)
# Create a folder called audio_chunks to save the chunks of wav files
folder_name = './data/audio/audio_chunks'
if not os.path.isdir(folder_name):
os.mkdir(folder_name)
full_text = ''
# Create an instance of the Recognizer class from SpeechRecognition
r = sr.Recognizer()
# Call the method that uses Google Speech Recognition API
# to transcribe the audio and return a string of text
for i, audio_chunk in enumerate(chunks, start=1):
chunk_filename = os.path.join(folder_name, f'chunk{i}.wav')
audio_chunk.export(chunk_filename, format='wav')
with sr.AudioFile(chunk_filename) as source:
audio_listened = r.record(source)
try:
text = r.recognize_google(audio_listened)
except sr.UnknownValueError as e:
print('Error:', str(e))
else:
text = f'{text.capitalize()}.'
print(chunk_filename, ':', text)
full_text += text
return full_text
The other features to extract from the videos are visual content. Extracting visual text is a faster process than extracting the visual objects.
VideoCapture
in the function I define as save_frames
.image_to_string
.extract_visual_text
and processed later using NLP.image_frames = './data/images/image_frames'
def save_frames(file_path):
'''
Creates image folder and saves video frames in the folder.
Parameters:
file_path (str): file path of video to be captured as images.
Returns:
image_frames folder where the video frames are stored.
'''
try:
os.remove(image_frames)
except OSError:
pass
# Create a folder called image_frames to save the images or frames of the video
if not os.path.exists(image_frames):
os.makedirs(image_frames)
# Capture every 20th frame of the video using cv2 from OpenCV and save to folder
src_vid = cv2.VideoCapture(file_path)
index = 0
while src_vid.isOpened():
ret, frame = src_vid.read()
if not ret:
break
name = './data/images/image_frames/frame' + str(index) + '.png'
if index % 20 == 0:
print('Extracting frames...' + name)
cv2.imwrite(name, frame)
index = index + 1
if cv2.waitKey(10) & 0xFF == ord('q'):
break
src_vid.release()
cv2.destroyAllWindows()
def extract_visual_text(file_path):
'''
Extracts visual text from images saved of video frames.
Parameters:
file_path (str): file path of video from which to extract the visual text.
Returns:
full_text (str): text as seen in the video taken from every 20th frame.
'''
save_frames(file_path)
print('Folder created.')
text_list = []
# Sort the frames in the folder using the function above for correct ordering
image_list = sorted_alphanumeric(os.listdir(image_frames))
# Open each image frame using PIL, and pass as argument in a function that uses
# Google Tesseract OCR to recognize text in the image
for i in image_list:
print(str(i))
single_frame = Image.open(image_frames + '/' + i)
text = pytesseract.image_to_string(single_frame, lang='eng')
text_list.append(text)
# Remove the new line character `\n` and the word TikTok
# from the strings of text returned and joined together
full_text = ' '.join([i for i in text_list])
full_text = full_text.replace('\n', '').replace('\x0c', '').replace('TikTok', '')
# Remove the folder to erase the image frames of the video
shutil.rmtree('./data/images/image_frames/')
print('Folder removed.')
return full_text
To extract the username from the visual text, I define the function extract_username
that executes the following steps:
most_frequent
.The most frequent word is most likely the username because TikTok automatically displays it for the full duration of the video, as in every single video frame.
def most_frequent(username_list):
'''Takes in a list of strings and return the most frequent word in the list or none.'''
most_frequent = max(set(username_list), key = username_list.count)
if most_frequent == '':
return np.nan
else:
return most_frequent
def extract_username(visual_text):
'''
Lists possible usernames from visual text and returns the most frequent one that may most likely be the username.
Parameters:
visual_text (str): full visual text extracted from video.
Returns:
username (str): most frequent word that starts with @ sign; if none, returns none.
'''
visual_text = ''.join([i for i in visual_text.lower() if not i.isdigit()])
text = ' '.join(visual_text.split())
text_list = [word for word in text.lower().split()]
username_list = []
for word in text_list:
if re.search(r'[@]', word):
username_list.extend([word.rsplit('@')[-1]])
if username_list == []:
return np.nan
else:
username_list = ' '.join([username for username in username_list])
username_list = [username for username in username_list.strip().split()]
try:
return most_frequent(username_list)
except:
return ' '.join(username_list)
The final feature to extract are objects in the videos, which is accomplished by the state-of-the-art object detection system YOLO that uses a deep learning algorithm.
Deep Learning
YOLO applies a convolutional neural network with this network architecture.
Image Source: https://arxiv.org/pdf/1506.02640.pdf
Compared to prior object detection systems YOLO uses a totally different approach—the network looks at the image once. Thus, the name You Only Look Once.
An example frame to illustrate:
Transfer Learning
The weights from the YOLO pre-trained network model are adapted to our data. To load the network model, download the weight and configuration files from Darknet. The configuration file describes the layout of the network by block.
“YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily tradeoff between speed and accuracy simply by changing the size of the model, no retraining required!” (Darknet)
Choosing YOLOv3-spp
for accuracy, the model with the highest mean average precision of 60.6 performed on the COCO dataset and for speed, you may try the model with the highest frame per second of 220 which is YOLOv3-tiny
.
# Load the network model into OpenCV using the configuration and weight files
net = cv2.dnn.readNetFromDarknet('./data/yolo/yolov3-spp.cfg', './data/yolo/yolov3-spp.weights')
This YOLO neural network consists of 263 parts such as convolutional layers (conv
), batch normalization (bn
) etc.
Printing them…
ln = net.getLayerNames() print(len(ln), ln)
…returns the following:
263 ('conv_0', 'bn_0', 'leaky_1', 'conv_1', 'bn_1', 'leaky_2', 'conv_2', 'bn_2', 'leaky_3', 'conv_3', 'bn_3', 'leaky_4', 'shortcut_4', 'conv_5', 'bn_5', 'leaky_6', 'conv_6', 'bn_6', 'leaky_7', 'conv_7', 'bn_7', 'leaky_8', 'shortcut_8', 'conv_9', 'bn_9', 'leaky_10', ...)
COCO
To get the labels of the model trained on the COCO dataset, download the COCO name file that contains the names of all the classes—the model can detect a total of 80 objects. A full list of object classes is provided in the name file, along with the weights and configuration files available in the data folder of the repository.
detect_object
.def detect_object(file_path):
'''
Uses YOLO algorithm to detect objects in video frames.
Parameters:
file_path (str): file path of video from which to detect objects in the frames.
Returns:
object_set (list): list of unique objects detected in the video.
'''
classes = []
with open('./data/yolo/coco.names', 'r') as f:
classes = f.read().splitlines()
try:
cap = cv2.VideoCapture(file_path)
count = 0
object_list = []
while cap.isOpened():
ret, img = cap.read()
if not ret:
break
if ret:
cv2.imwrite('frame{:d}.jpg'.format(count), img)
count += 50
cap.set(cv2.CAP_PROP_POS_FRAMES, count)
height, width, _ = img.shape
blob = cv2.dnn.blobFromImage(img, 1/255, (416, 416), (0,0,0), swapRB=True, crop=False)
net.setInput(blob)
output_layers_names = net.getUnconnectedOutLayersNames()
layerOutputs = net.forward(output_layers_names)
boxes = []
confidences = []
class_ids = []
for output in layerOutputs:
for detection in output:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
center_x = int(detection[0]*width)
center_y = int(detection[1]*height)
w = int(detection[2]*width)
h = int(detection[3]*height)
x = int(center_x - w/2)
y = int(center_y - h/2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)
print(len(boxes))
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
if len(indexes) > 0:
print(indexes.flatten())
for i in indexes.flatten():
label = str(classes[class_ids[i]])
object_list.append(label)
else:
cap.release()
cv2.destroyAllWindows()
break
cap.release()
cv2.destroyAllWindows()
object_set = list(set(object_list))
print('Done detecting object in this video.')
print(f'These are the objects detected: {object_set}')
return object_set
except:
print(f'{filename} did not work.')
To process the text features, I utilize the Natural Language Toolkit (nltk
) library for standardization to make the letters lowercase, to remove punctuation marks and stopwords, for tokenization and lemmatization. Likewise, I utilize WordSegment
to segment the strings of words without spaces between them.
# Convert speech transcribed to lowercase and remove full stop
data['standardized_audio_text'] = data['audio_text'].apply(lambda x: x.lower().replace('.', ''))
def process_visual_text(text):
'''Processes string of text by removing punctuation marks and segment words.'''
text = text.lower()
text = re.sub(r'([^A-Za-z0-9|\s|[:punct:]]*)', '', text)
text = text.replace('|', '').replace(':', '')
text = wordsegment.segment(text)
text = ' '.join([i for i in text if i in words])
return text
# Apply the function and add a column to the table for the processed visual text
data['processed_visual_text'] = data['visual_text'].apply(process_visual_text)
def segment_text(text):
'''Segments strings of words without spaces between them.'''
text = wordsegment.segment(text)
text = ' '.join([i for i in text])
return text
# Convert list to string and add a column to the table for the processed objects
data['object_text'] = data['object_list'].apply(lambda x: ' '.join([word for word in x]))
data['object_text'] = data['object_text'].apply(segment_text)
For instance:
Extracted Text | Processed Text |
---|---|
XeThis NY restaurant will “make you feel <like you’re inItaly! { | This NY restaurant will make you feel like you’re in Italy |
\:@ Unique and diverse Italian |menu! %Private & romantic dininga | Unique and diverse Italian menu Private & romantic dining |
Open 7 days a week withbrunch options on {Saturdays & Sundays!— “ Make your reservationasap! =ro mm tT | Open 7 days a week with brunch options on Saturdays & Sundays Make your reservations asap |
To create the search engine, the Okapi BM25 algorithm is implemented from the package rank-bm25
:
“In information retrieval, Okapi BM25 (BM is an abbreviation of best matching) is a ranking function used by search engines to estimate the relevance of documents to a given search query” (Wikipedia)
nlp = spacy.load('en_core_web_sm')
tokenized_text = []
# Tokenize processed text using spaCy
for doc in tqdm(nlp.pipe(data['preprocessed_text'].fillna('').str.lower().values, disable=['tagger', 'parser', 'ner'])):
tokenized = [token.text for token in doc if token.is_alpha]
tokenized_corpus.append(tokenized)
# Instantiate class and read corpus
bm25 = BM25Okapi(tokenized_corpus)
def search_video(query, result=3, n=1):
'''
Returns matching video to the search query.
Parameters:
query (str): word or phrases to search.
result (int): the number of results to search.
n (int): the nth result to display.
Returns:
video to display from the list of results.
'''
tokenized_query = query.lower().split(' ')
results = bm25.get_top_n(tokenized_query, data['file_path'], result)
results_list = [video for video in results]
video = Video(results_list[n-1], width=300)
print(results_list[n-1])
return video
search_video('italian restaurant ny umbrella')
search_video('chinatown dumplings')
Finally, I build the web app! Machine Learning-Powered Video Library
Source Code: Github Repository
Feel free to contact me for any questions and connect with me on Linkedin.