Monday, June 26, 2023

Create your own private offline AI Q&A chatbot trained on Veeam data: Part 1

Hi all, in 2 blogposts, I share my journey how i created a self hosted Large Language Model Chatbot trained without the need for fancy high end CPU/GPU and completely offline without dependencies on public / paying services.

This first part is focused on data gathering. As we all now rubbish in is rubbish out, training your model with correct and valuable data will give you good results in the end.

I share you a method to get valueable information from public videos and use them to train your model.

In the second post I’ll talk about actually building the chatbot and training it with your own datasources. Our chatbot can digest a lot of filetypes to create an as rich as possible dataset.

The supported extensions are:

  • .csv: CSV,
  • .docx: Word Document,
  • .doc: Word Document,
  • .enex: EverNote,
  • .eml: Email,
  • .epub: EPub,
  • .html: HTML File,
  • .md: Markdown,
  • .msg: Outlook Message,
  • .odt: Open Document Text,
  • .pdf: Portable Document Format (PDF),
  • .pptx : PowerPoint Document,
  • .ppt : PowerPoint Document,
  • .txt: Text file (UTF-8),

The goals:

  • Building a chatbot without the need for internet, for internal company use
  • No data is exposed externally on public services (keep your data private)
  • Train the bot on a own set of different types of data formats

Where do I get my training data ?

I was already thinking some weeks on doing this and then Ben Young posted his VannyGPT 1.0 on the Veeam community website.

Ben has done a terrific job in using the Veeam KB articles as a datasource.

Since there is no clean data available to train your model, this is already a good start point to gather all necessary data.

But ... I wanted to take it to the next level. 

There are a lot of good videos available on YouTube in several channels and playlists.

Some examples on the official Veeam channel:

  • The Tech Bites Livestreams give valuable information
  • Veeam How To Series, the technical LinkedIn Live sessions with some deep dives.
  • The Veeam How-To Video Series is designed to help users better understand Veeam products and how to use them most effectively.
  • How-To Series: MSP backup

Shouldn't it be nice to get all this content included in our model to train it with more and topic specific Veeam knowledge ?

Well, that’s possible !

I’m not a programmer, just a script kiddie so my code is far from perfect. But together with the help of the public ChatGPT's python assistance, I was able to construct some scripts in Python to grab all this content to text files to build a valid training set.

What are our ingredients ?

The youtube-transcript-api project which easy allows to get transcripts from youtube videos. (both generated transcripts and regular transcripts)

https://pypi.org/project/youtube-transcript-api/

The Google API Discovery Service

We’ll need this to fetch all the videos part of a certain playlist, based on the playlistID

To be able to use this Google API you need a developer key, which you can get for free in your Google account.

How to create this key can be found at:

https://cloud.google.com/docs/authentication/api-keys

First of all I was just playing around with a single video to start with. As you probably know every video had his own Youtube Video ID. So I didn’t need a Google API key.

  srt = YouTubeTranscriptApi.get_transcript("em1M98GiQ0c",languages=['en']) 

When using the youtube-transcript-api library, you’ll see you get a ton of information on the transcripts of videos. Not only the language but also starting timestamp of the text and the duration of it.

A point of attention is that this library uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work.

So the raw output looks like:

 [{'text': 'okay welcome back to the next upgrade', 'start': 0.78,
'duration': 4.2}, {'text': 'Center video', 'start': 3.6, 'duration': 4.8},
{'text': 'so the first two that are out there', 'start': 4.98, 'duration':
6.6}, {'text': "was Kirsten's upgrading V1 and then my", 'start':
8.4, 'duration': 5.819}, {'text': "Enterprise Manager and it's important
to", 'start': 11.58, 'duration': 5.279}, {'text': 'do these in the right
sequence here so', 'start': 14.219, 'duration': 4.381}, {'text': 'start with
those', 'start': 16.859, 'duration': 4.201}, {'text': "and what we're
going to do now is go", 'start': 18.6, 'duration': 5.519}, {'text': 'into
a BNR server upgrade and then the', 'start': 21.06, 'duration': 6.539},
{'text': 'components and agents so um to get your', 'start': 24.119,
'duration': 6.301}, {'text': 'ISO your download installable when you', 'start':
27.599, 'duration': 3.721}, {'text': 'log in', 'start': 30.42, 'duration':
3.659}, {'text': 'just click on downloads up here and then', 'start': 31.32,
'duration': 5.28}, {'text': "you'll go down and um", 'start': 34.079,
'duration': 5.281}, {'text': "I've been getting the advanced 

Playing around with some REGEX allows you to filter out the exact part of text we need for further processing.

 pattern = r"'text': '(.*?)','start'"
   for i in srt:      # writing each element of srt on a new line
       match = re.findall(pattern, str(i))

The “de-um-ifyer”

But then when checking the output of several video’s I’ve noticed something on the subtitles created by YouTube:


When people are giving demo’s or just talking during presentations with slides, there are a lot of stop-words, uh, um, yeah’s and other words are included in the transcript and will pollute our source data.

A large language model (LLM) gives a certain weight to words and their relations. So when we leave it like this, these words are quite frequent in our learning data and makes our chatbot also using these words.

So I introduced a small check, let’s call it the “de-um-ifyer”

 if match:
     extracted_text = match[0]
     words_to_remove = ['yeah', 'uh', 'um', '[music]']
     modified_text = extracted_text
     for word in words_to_remove:
         modified_text = modified_text.replace(word, '')


For each line of text that the TranscriptApi spits out, we check it against a predefined set of words you don’t want to include in your dataset. Typical YouTube also indicates when music is playing in the generated script with the text [Music]. Of course, this we don’t need in our model and filter it out.

So when we bring this together we already can get a good transcript of every Youtube video we can imagine.

Scaling up

Time to make it more scalable now. Just entering the video ID’s is a painfull job. We can collect them in a file, cycle through the file and get the transcripts, but most of the videos I want to use to train my model are organised in playlists. So wouldn’t there be an easy solution when giving the playlist ID ? 

Just fetch all the underlying video id’s of the videos in that playlist and get their transcripts.

This is possible ! Therefore we need our Google API which I mentioned above.

With the usage of this key we can call the Youtube API via the googleapis.com endpoint and get all the items back in a playlist.

Within the returning JSON, you'll fnd the videoId which we need to download the transcripts of the individual videos.

The call i’ve used get’s the results in several pages if there are more than 50 video’s in one playlist so we have to cycle through the playlist.

# Function to retrieve all video IDs from a playlist
def get_video_ids(playlist_id):
    video_ids = []
    next_page_token = None
    while True:
        # Construct the request URL
        url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=contentDetails&maxResults=50&playlistId={playlist_id}&key={api_key}'
        if next_page_token:
            url += f'&pageToken={next_page_token}'

        # Send the GET request to the YouTube Data API
        response = requests.get(url)
        data = response.json()

        # Extract video IDs from the response
        for item in data['items']:
            video_ids.append(item['contentDetails']['videoId'])

        # Check if there are more pages
        next_page_token = data.get('nextPageToken')
        if not next_page_token:
            break

    return video_ids
 

Get the video titles and use them as filenames

Maintaining you source training data can be hard when you have a lot of documents, so an unique name would be a good start. Therefore I fetch the video title and use it as the filename when writing the transcript in text format. 

The function looks like this:

 def get_video_title(video_id):
    # Request the video resource
    video_response = youtube.videos().list(
        part='snippet',
        id=video_id
    ).execute()
    # Extract the title from the response
    video_title = video_response['items'][0]['snippet']['title']
    return video_title
  

Taping it all together:

Time to glue everything together in one python script. This script will now download all videos of a playlist, get the English transcription of it, remove the stop-words and save it in an UTF-8 format text file.

 import requests
 from youtube_transcript_api import YouTubeTranscriptApi
 import re
 import googleapiclient.discovery
 import os
 # Set your API key
 api_key = 'InsertYourAPIKeyHere'

 # Playlist ID from the YouTube URL you provided
 playlist_id = 'PL0afnnnx_OVCW8nmECmiR3z34beBe2l0-'
 youtube = googleapiclient.discovery.build('youtube', 'v3', developerKey=api_key)

 # Function to retrieve all video IDs from a playlist
 def get_video_ids(playlist_id):
     video_ids = []
     next_page_token = None

     while True:
         # Construct the request URL
         url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=\contentDetails&maxResults=50&playlistId={playlist_id}&key={api_key}'
         if next_page_token:
             url += f'&pageToken={next_page_token}'

         # Send the GET request to the YouTube Data API
         response = requests.get(url)
         data = response.json()

         # Extract video IDs from the response
         for item in data['items']:
             video_ids.append(item['contentDetails']['videoId'])

         # Check if there are more pages
         next_page_token = data.get('nextPageToken')
         if not next_page_token:
             break

     return video_ids
 def get_video_title(video_id):
     # Request the video resource
     video_response = youtube.videos().list(
         part='snippet',
         id=video_id
     ).execute()

     # Extract the title from the response
     video_title = video_response['items'][0]['snippet']['title']

     return video_title

 # Call the function to get the video IDs
 video_ids = get_video_ids(playlist_id)

 # Print the video IDs
 for video_id in video_ids:
     try:
         srt = YouTubeTranscriptApi.get_transcript(video_id,languages=['en'])
     except:
         print(f"{video_id} doesn't have a transcript") # Skip videos with no transcripts available
     title = get_video_title(video_id)
     title = str(title.replace('/', '-'))  
     pattern = r"'text': '(.*?)', 'start'"

     with open(title+".txt", "w",encoding='utf8') as f:   
   
         # iterating through each element of list srt
         for i in srt:
             # writing each element of srt on a new line
             match = re.findall(pattern, str(i))

             if match:
                 extracted_text = match[0]
                 words_to_remove = ['yeah', 'uh', 'um','[Music]']
                 modified_text = extracted_text

                 for word in words_to_remove:
                     modified_text = modified_text.replace(word, '')
                 f.write(modified_text+'\n')

What's next ?

These files will be part of the many datasources we're going to use to train our datamodel which I'll explain in my next post. The model is completely standalone, doensn't need any internet service or paid account to get it working.


This means that al the data is kept private and allows you to use this model to work with internal documenten without fearing that data is leaking to the outside world.