In machine learning, data is represented by vectors. Essentially, training a learning algorithm is finding more descriptive representations of data through a series of transformations.
Linear algebra is the study of vector spaces and their transformations.
GPT-4o is slower than Flash, more expensive, chatty, and very stubborn (it doesn't like to stick to my prompts).
Next week, I'll post a step-by-step video on how to build this.
The first request takes longer (warming up), but things work faster from that point.
Few opportunities to improve this:
1. Stream answers from the model (instead of waiting for the full answer.)
2. Add the ability to interrupt the assistant.
3. Whisper running on GPU
Unfortunately, no local modal supports text+images (as far as I know,) so I'm stuck running online models.
The TTS API (synthesizing text to audio) can also be replaced by a local version. I tried, but the available voices suck (too robotic), so I kept OpenAI's.