WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:01.630 align:start position:0%
 
If<00:00:00.200><c> you</c><00:00:00.280><c> grew</c><00:00:00.440><c> up</c><00:00:00.680><c> watching</c><00:00:01.160><c> cool</c><00:00:01.360><c> bullets</c>

00:00:01.630 --> 00:00:01.640 align:start position:0%
If you grew up watching cool bullets
 

00:00:01.640 --> 00:00:03.550 align:start position:0%
If you grew up watching cool bullets
just<00:00:01.840><c> like</c><00:00:02.040><c> me,</c><00:00:02.240><c> then</c><00:00:02.480><c> we</c><00:00:02.680><c> most</c><00:00:02.920><c> likely</c><00:00:03.280><c> at</c><00:00:03.360><c> one</c>

00:00:03.550 --> 00:00:03.560 align:start position:0%
just like me, then we most likely at one
 

00:00:03.560 --> 00:00:05.150 align:start position:0%
just like me, then we most likely at one
point<00:00:03.840><c> thought</c><00:00:04.120><c> the</c><00:00:04.280><c> AI</c><00:00:04.520><c> in</c><00:00:04.600><c> the</c><00:00:04.680><c> future</c><00:00:05.040><c> is</c>

00:00:05.150 --> 00:00:05.160 align:start position:0%
point thought the AI in the future is
 

00:00:05.160 --> 00:00:06.630 align:start position:0%
point thought the AI in the future is
probably<00:00:05.440><c> going</c><00:00:05.640><c> to</c><00:00:05.720><c> be</c><00:00:05.840><c> trained</c><00:00:06.200><c> based</c><00:00:06.480><c> on</c>

00:00:06.630 --> 00:00:06.640 align:start position:0%
probably going to be trained based on
 

00:00:06.640 --> 00:00:09.030 align:start position:0%
probably going to be trained based on
some<00:00:06.840><c> sort</c><00:00:07.200><c> of</c><00:00:07.400><c> genetic</c><00:00:07.960><c> algorithms</c><00:00:08.520><c> or</c><00:00:08.720><c> even</c>

00:00:09.030 --> 00:00:09.040 align:start position:0%
some sort of genetic algorithms or even
 

00:00:09.040 --> 00:00:10.750 align:start position:0%
some sort of genetic algorithms or even
evolution<00:00:09.560><c> strategies.</c><00:00:10.280><c> Because</c><00:00:10.480><c> most</c><00:00:10.680><c> of</c>

00:00:10.750 --> 00:00:10.760 align:start position:0%
evolution strategies. Because most of
 

00:00:10.760 --> 00:00:12.590 align:start position:0%
evolution strategies. Because most of
the<00:00:10.840><c> game</c><00:00:11.120><c> related</c><00:00:11.560><c> AIs</c><00:00:11.920><c> back</c><00:00:12.120><c> in</c><00:00:12.200><c> the</c><00:00:12.280><c> days</c>

00:00:12.590 --> 00:00:12.600 align:start position:0%
the game related AIs back in the days
 

00:00:12.600 --> 00:00:14.390 align:start position:0%
the game related AIs back in the days
were<00:00:12.720><c> made</c><00:00:13.040><c> through</c><00:00:13.280><c> this</c><00:00:13.600><c> simple</c><00:00:14.160><c> yet</c>

00:00:14.390 --> 00:00:14.400 align:start position:0%
were made through this simple yet
 

00:00:14.400 --> 00:00:16.150 align:start position:0%
were made through this simple yet
powerful<00:00:14.800><c> idea</c><00:00:15.320><c> that</c><00:00:15.680><c> it</c><00:00:15.760><c> feels</c><00:00:16.000><c> like</c>

00:00:16.150 --> 00:00:16.160 align:start position:0%
powerful idea that it feels like
 

00:00:16.160 --> 00:00:18.190 align:start position:0%
powerful idea that it feels like
anything<00:00:16.480><c> can</c><00:00:16.800><c> be</c><00:00:16.960><c> trained</c><00:00:17.320><c> with</c><00:00:17.480><c> evolution.</c>

00:00:18.190 --> 00:00:18.200 align:start position:0%
anything can be trained with evolution.
 

00:00:18.200 --> 00:00:19.910 align:start position:0%
anything can be trained with evolution.
Not<00:00:18.400><c> to</c><00:00:18.480><c> mention</c><00:00:18.920><c> we</c><00:00:19.120><c> humans</c><00:00:19.480><c> became</c><00:00:19.760><c> this</c>

00:00:19.910 --> 00:00:19.920 align:start position:0%
Not to mention we humans became this
 

00:00:19.920 --> 00:00:21.510 align:start position:0%
Not to mention we humans became this
intelligent<00:00:20.560><c> thanks</c><00:00:20.800><c> to</c><00:00:20.920><c> this</c><00:00:21.080><c> natural</c>

00:00:21.510 --> 00:00:21.520 align:start position:0%
intelligent thanks to this natural
 

00:00:21.520 --> 00:00:23.590 align:start position:0%
intelligent thanks to this natural
phenomenon.<00:00:22.280><c> So,</c><00:00:22.560><c> I</c><00:00:22.720><c> guess</c><00:00:23.000><c> it's</c><00:00:23.200><c> not</c><00:00:23.440><c> too</c>

00:00:23.590 --> 00:00:23.600 align:start position:0%
phenomenon. So, I guess it's not too
 

00:00:23.600 --> 00:00:25.470 align:start position:0%
phenomenon. So, I guess it's not too
crazy<00:00:24.080><c> betting</c><00:00:24.400><c> on</c><00:00:24.600><c> evolution</c><00:00:25.040><c> strategies</c>

00:00:25.470 --> 00:00:25.480 align:start position:0%
crazy betting on evolution strategies
 

00:00:25.480 --> 00:00:27.550 align:start position:0%
crazy betting on evolution strategies
being<00:00:25.680><c> the</c><00:00:25.800><c> one</c><00:00:26.040><c> that</c><00:00:26.120><c> will</c><00:00:26.200><c> bring</c><00:00:26.440><c> us</c><00:00:26.600><c> to</c><00:00:26.800><c> AGI</c>

00:00:27.550 --> 00:00:27.560 align:start position:0%
being the one that will bring us to AGI
 

00:00:27.560 --> 00:00:29.750 align:start position:0%
being the one that will bring us to AGI
10<00:00:27.840><c> years</c><00:00:28.120><c> ago.</c><00:00:28.480><c> However,</c><00:00:29.160><c> as</c><00:00:29.320><c> you</c><00:00:29.360><c> can</c><00:00:29.560><c> see</c>

00:00:29.750 --> 00:00:29.760 align:start position:0%
10 years ago. However, as you can see
 

00:00:29.760 --> 00:00:31.550 align:start position:0%
10 years ago. However, as you can see
now,<00:00:30.080><c> none</c><00:00:30.360><c> of</c><00:00:30.480><c> the</c><00:00:30.560><c> current</c><00:00:30.880><c> AI</c><00:00:31.040><c> methods</c>

00:00:31.550 --> 00:00:31.560 align:start position:0%
now, none of the current AI methods
 

00:00:31.560 --> 00:00:33.990 align:start position:0%
now, none of the current AI methods
incorporate<00:00:32.400><c> any</c><00:00:32.680><c> evolution</c><00:00:33.280><c> strategies</c><00:00:33.800><c> at</c>

00:00:33.990 --> 00:00:34.000 align:start position:0%
incorporate any evolution strategies at
 

00:00:34.000 --> 00:00:35.510 align:start position:0%
incorporate any evolution strategies at
all.<00:00:34.280><c> And</c><00:00:34.400><c> it</c><00:00:34.520><c> might</c><00:00:34.720><c> as</c><00:00:34.840><c> well</c><00:00:35.000><c> be</c><00:00:35.160><c> a</c><00:00:35.240><c> dead</c>

00:00:35.510 --> 00:00:35.520 align:start position:0%
all. And it might as well be a dead
 

00:00:35.520 --> 00:00:37.230 align:start position:0%
all. And it might as well be a dead
optimization<00:00:36.120><c> method</c><00:00:36.520><c> that</c><00:00:36.640><c> we</c><00:00:36.760><c> should</c><00:00:36.960><c> frame</c>

00:00:37.230 --> 00:00:37.240 align:start position:0%
optimization method that we should frame
 

00:00:37.240 --> 00:00:39.430 align:start position:0%
optimization method that we should frame
up<00:00:37.440><c> in</c><00:00:37.640><c> the</c><00:00:37.760><c> museum.</c><00:00:38.360><c> Or</c><00:00:38.680><c> not.</c><00:00:39.200><c> Because</c>

00:00:39.430 --> 00:00:39.440 align:start position:0%
up in the museum. Or not. Because
 

00:00:39.440 --> 00:00:41.510 align:start position:0%
up in the museum. Or not. Because
recently<00:00:40.080><c> evolution</c><00:00:40.600><c> strategies</c><00:00:41.160><c> have</c>

00:00:41.510 --> 00:00:41.520 align:start position:0%
recently evolution strategies have
 

00:00:41.520 --> 00:00:44.190 align:start position:0%
recently evolution strategies have
emerged<00:00:42.080><c> again</c><00:00:42.520><c> in</c><00:00:42.720><c> the</c><00:00:42.920><c> LM</c><00:00:43.360><c> literature.</c><00:00:44.040><c> But</c>

00:00:44.190 --> 00:00:44.200 align:start position:0%
emerged again in the LM literature. But
 

00:00:44.200 --> 00:00:46.350 align:start position:0%
emerged again in the LM literature. But
how<00:00:44.440><c> is</c><00:00:44.640><c> this</c><00:00:44.840><c> abandoned</c><00:00:45.440><c> method</c><00:00:45.960><c> suddenly</c>

00:00:46.350 --> 00:00:46.360 align:start position:0%
how is this abandoned method suddenly
 

00:00:46.360 --> 00:00:48.150 align:start position:0%
how is this abandoned method suddenly
making<00:00:46.680><c> a</c><00:00:46.760><c> comeback?</c><00:00:47.320><c> Well,</c><00:00:47.560><c> the</c><00:00:47.680><c> bottleneck</c>

00:00:48.150 --> 00:00:48.160 align:start position:0%
making a comeback? Well, the bottleneck
 

00:00:48.160 --> 00:00:50.470 align:start position:0%
making a comeback? Well, the bottleneck
it<00:00:48.360><c> has</c><00:00:48.960><c> apparently</c><00:00:49.440><c> has</c><00:00:49.680><c> been</c><00:00:49.840><c> solved</c><00:00:50.280><c> and</c><00:00:50.400><c> it</c>

00:00:50.470 --> 00:00:50.480 align:start position:0%
it has apparently has been solved and it
 

00:00:50.480 --> 00:00:52.590 align:start position:0%
it has apparently has been solved and it
came<00:00:50.800><c> along</c><00:00:51.160><c> with</c><00:00:51.400><c> even</c><00:00:51.720><c> more</c><00:00:51.960><c> upsides</c><00:00:52.440><c> than</c>

00:00:52.590 --> 00:00:52.600 align:start position:0%
came along with even more upsides than
 

00:00:52.600 --> 00:00:54.390 align:start position:0%
came along with even more upsides than
we<00:00:52.720><c> initially</c><00:00:53.200><c> expect.</c><00:00:53.800><c> But</c><00:00:53.880><c> before</c><00:00:54.120><c> we</c><00:00:54.200><c> dive</c>

00:00:54.390 --> 00:00:54.400 align:start position:0%
we initially expect. But before we dive
 

00:00:54.400 --> 00:00:56.630 align:start position:0%
we initially expect. But before we dive
into<00:00:54.720><c> it,</c><00:00:54.960><c> SDAI</c><00:00:55.440><c> competition</c><00:00:55.920><c> is</c><00:00:56.120><c> constantly</c>

00:00:56.630 --> 00:00:56.640 align:start position:0%
into it, SDAI competition is constantly
 

00:00:56.640 --> 00:00:58.630 align:start position:0%
into it, SDAI competition is constantly
changing.<00:00:57.360><c> Bouncing</c><00:00:57.680><c> around</c><00:00:58.040><c> five</c><00:00:58.320><c> different</c>

00:00:58.630 --> 00:00:58.640 align:start position:0%
changing. Bouncing around five different
 

00:00:58.640 --> 00:00:59.990 align:start position:0%
changing. Bouncing around five different
chatbots<00:00:59.120><c> with</c><00:00:59.400><c> five</c><00:00:59.680><c> different</c>

00:00:59.990 --> 00:01:00.000 align:start position:0%
chatbots with five different
 

00:01:00.000 --> 00:01:01.870 align:start position:0%
chatbots with five different
subscriptions<00:01:00.760><c> is</c><00:01:00.960><c> just</c><00:01:01.120><c> not</c><00:01:01.320><c> it.</c><00:01:01.600><c> Because</c>

00:01:01.870 --> 00:01:01.880 align:start position:0%
subscriptions is just not it. Because
 

00:01:01.880 --> 00:01:03.750 align:start position:0%
subscriptions is just not it. Because
why<00:01:02.080><c> even</c><00:01:02.360><c> bother</c><00:01:02.680><c> to</c><00:01:02.800><c> pay</c><00:01:02.960><c> 100</c><00:01:03.360><c> bucks</c><00:01:03.600><c> in</c>

00:01:03.750 --> 00:01:03.760 align:start position:0%
why even bother to pay 100 bucks in
 

00:01:03.760 --> 00:01:05.430 align:start position:0%
why even bother to pay 100 bucks in
subscription<00:01:04.320><c> fee</c><00:01:04.600><c> and</c><00:01:04.720><c> not</c><00:01:04.920><c> using</c><00:01:05.120><c> the</c><00:01:05.199><c> full</c>

00:01:05.430 --> 00:01:05.440 align:start position:0%
subscription fee and not using the full
 

00:01:05.440 --> 00:01:07.150 align:start position:0%
subscription fee and not using the full
value<00:01:05.760><c> of</c><00:01:05.920><c> it</c><00:01:06.120><c> when</c><00:01:06.240><c> you</c><00:01:06.320><c> can</c><00:01:06.440><c> just</c><00:01:06.600><c> pay</c><00:01:06.840><c> 10</c>

00:01:07.150 --> 00:01:07.160 align:start position:0%
value of it when you can just pay 10
 

00:01:07.160 --> 00:01:08.670 align:start position:0%
value of it when you can just pay 10
bucks<00:01:07.480><c> on</c><00:01:07.640><c> Map</c><00:01:07.840><c> Muse.</c><00:01:08.120><c> And</c><00:01:08.280><c> you</c><00:01:08.360><c> don't</c><00:01:08.520><c> just</c>

00:01:08.670 --> 00:01:08.680 align:start position:0%
bucks on Map Muse. And you don't just
 

00:01:08.680 --> 00:01:10.990 align:start position:0%
bucks on Map Muse. And you don't just
get<00:01:08.920><c> five</c><00:01:09.360><c> other</c><00:01:09.600><c> models,</c><00:01:10.080><c> but</c><00:01:10.320><c> even</c><00:01:10.600><c> more.</c>

00:01:10.990 --> 00:01:11.000 align:start position:0%
get five other models, but even more.
 

00:01:11.000 --> 00:01:13.510 align:start position:0%
get five other models, but even more.
Ranging<00:01:11.240><c> from</c><00:01:11.440><c> Claude,</c><00:01:11.920><c> GPT,</c><00:01:12.520><c> Gemini,</c><00:01:13.000><c> Llama,</c>

00:01:13.510 --> 00:01:13.520 align:start position:0%
Ranging from Claude, GPT, Gemini, Llama,
 

00:01:13.520 --> 00:01:15.510 align:start position:0%
Ranging from Claude, GPT, Gemini, Llama,
Mistral,<00:01:14.000><c> Grok,</c><00:01:14.400><c> Deep</c><00:01:14.640><c> Sea,</c><00:01:14.840><c> Perplexity,</c>

00:01:15.510 --> 00:01:15.520 align:start position:0%
Mistral, Grok, Deep Sea, Perplexity,
 

00:01:15.520 --> 00:01:18.030 align:start position:0%
Mistral, Grok, Deep Sea, Perplexity,
Flux,<00:01:15.920><c> Nano,</c><00:01:16.200><c> Banana,</c><00:01:16.760><c> Recraft.</c><00:01:17.520><c> And</c><00:01:17.680><c> instead</c>

00:01:18.030 --> 00:01:18.040 align:start position:0%
Flux, Nano, Banana, Recraft. And instead
 

00:01:18.040 --> 00:01:19.310 align:start position:0%
Flux, Nano, Banana, Recraft. And instead
of<00:01:18.120><c> betting</c><00:01:18.480><c> everything</c><00:01:18.920><c> in</c><00:01:19.120><c> one</c>

00:01:19.310 --> 00:01:19.320 align:start position:0%
of betting everything in one
 

00:01:19.320 --> 00:01:21.310 align:start position:0%
of betting everything in one
subscription,<00:01:20.120><c> this</c><00:01:20.320><c> platform</c><00:01:20.720><c> also</c><00:01:21.000><c> lets</c>

00:01:21.310 --> 00:01:21.320 align:start position:0%
subscription, this platform also lets
 

00:01:21.320 --> 00:01:23.390 align:start position:0%
subscription, this platform also lets
you<00:01:21.440><c> re-prompt,</c><00:01:22.160><c> compare</c><00:01:22.560><c> answers,</c><00:01:23.040><c> and</c><00:01:23.200><c> have</c>

00:01:23.390 --> 00:01:23.400 align:start position:0%
you re-prompt, compare answers, and have
 

00:01:23.400 --> 00:01:25.110 align:start position:0%
you re-prompt, compare answers, and have
models<00:01:23.800><c> challenge</c><00:01:24.200><c> each</c><00:01:24.440><c> other</c><00:01:24.720><c> so</c><00:01:24.840><c> you</c><00:01:24.920><c> get</c>

00:01:25.110 --> 00:01:25.120 align:start position:0%
models challenge each other so you get
 

00:01:25.120 --> 00:01:26.950 align:start position:0%
models challenge each other so you get
higher<00:01:25.400><c> quality</c><00:01:25.880><c> outputs</c><00:01:26.280><c> without</c><00:01:26.600><c> vendor</c>

00:01:26.950 --> 00:01:26.960 align:start position:0%
higher quality outputs without vendor
 

00:01:26.960 --> 00:01:28.910 align:start position:0%
higher quality outputs without vendor
lock-in<00:01:27.720><c> while</c><00:01:27.960><c> not</c><00:01:28.240><c> needing</c><00:01:28.560><c> to</c><00:01:28.680><c> switch</c>

00:01:28.910 --> 00:01:28.920 align:start position:0%
lock-in while not needing to switch
 

00:01:28.920 --> 00:01:31.230 align:start position:0%
lock-in while not needing to switch
between<00:01:29.240><c> 10</c><00:01:29.560><c> tabs</c><00:01:30.000><c> to</c><00:01:30.120><c> check</c><00:01:30.400><c> manually</c><00:01:31.000><c> which</c>

00:01:31.230 --> 00:01:31.240 align:start position:0%
between 10 tabs to check manually which
 

00:01:31.240 --> 00:01:33.150 align:start position:0%
between 10 tabs to check manually which
one<00:01:31.440><c> is</c><00:01:31.640><c> the</c><00:01:31.760><c> best.</c><00:01:32.120><c> This</c><00:01:32.360><c> multi-model</c>

00:01:33.150 --> 00:01:33.160 align:start position:0%
one is the best. This multi-model
 

00:01:33.160 --> 00:01:34.830 align:start position:0%
one is the best. This multi-model
loadout<00:01:33.560><c> would</c><00:01:33.760><c> also</c><00:01:34.000><c> be</c><00:01:34.120><c> capable</c><00:01:34.600><c> of</c>

00:01:34.830 --> 00:01:34.840 align:start position:0%
loadout would also be capable of
 

00:01:34.840 --> 00:01:37.190 align:start position:0%
loadout would also be capable of
analyzing<00:01:35.320><c> documents</c><00:01:35.840><c> and</c><00:01:36.000><c> images,</c><00:01:36.720><c> use</c><00:01:37.000><c> deep</c>

00:01:37.190 --> 00:01:37.200 align:start position:0%
analyzing documents and images, use deep
 

00:01:37.200 --> 00:01:39.310 align:start position:0%
analyzing documents and images, use deep
research<00:01:37.640><c> through</c><00:01:37.840><c> Perplexity,</c><00:01:38.680><c> and</c><00:01:38.880><c> even</c><00:01:39.160><c> do</c>

00:01:39.310 --> 00:01:39.320 align:start position:0%
research through Perplexity, and even do
 

00:01:39.320 --> 00:01:41.230 align:start position:0%
research through Perplexity, and even do
voice<00:01:39.640><c> chat</c><00:01:39.840><c> and</c><00:01:39.960><c> dictation</c><00:01:40.600><c> when</c><00:01:40.760><c> you</c><00:01:40.880><c> want</c>

00:01:41.230 --> 00:01:41.240 align:start position:0%
voice chat and dictation when you want
 

00:01:41.240 --> 00:01:43.070 align:start position:0%
voice chat and dictation when you want
to<00:01:41.320><c> move</c><00:01:41.560><c> faster.</c><00:01:42.160><c> And</c><00:01:42.280><c> if</c><00:01:42.360><c> your</c><00:01:42.520><c> task</c><00:01:42.880><c> is</c>

00:01:43.070 --> 00:01:43.080 align:start position:0%
to move faster. And if your task is
 

00:01:43.080 --> 00:01:44.990 align:start position:0%
to move faster. And if your task is
repetitive,<00:01:43.800><c> you</c><00:01:43.880><c> can</c><00:01:44.000><c> just</c><00:01:44.200><c> use</c><00:01:44.400><c> the</c><00:01:44.520><c> project</c>

00:01:44.990 --> 00:01:45.000 align:start position:0%
repetitive, you can just use the project
 

00:01:45.000 --> 00:01:46.830 align:start position:0%
repetitive, you can just use the project
function<00:01:45.480><c> where</c><00:01:45.600><c> you</c><00:01:45.680><c> can</c><00:01:45.840><c> create</c><00:01:46.200><c> custom</c><00:01:46.600><c> Map</c>

00:01:46.830 --> 00:01:46.840 align:start position:0%
function where you can create custom Map
 

00:01:46.840 --> 00:01:49.070 align:start position:0%
function where you can create custom Map
Muses<00:01:47.320><c> with</c><00:01:47.520><c> your</c><00:01:47.720><c> own</c><00:01:47.920><c> instructions,</c><00:01:48.840><c> set</c><00:01:49.040><c> a</c>

00:01:49.070 --> 00:01:49.080 align:start position:0%
Muses with your own instructions, set a
 

00:01:49.080 --> 00:01:51.070 align:start position:0%
Muses with your own instructions, set a
default<00:01:49.480><c> model</c><00:01:49.800><c> that</c><00:01:49.960><c> matches</c><00:01:50.320><c> your</c><00:01:50.480><c> habits,</c>

00:01:51.070 --> 00:01:51.080 align:start position:0%
default model that matches your habits,
 

00:01:51.080 --> 00:01:53.030 align:start position:0%
default model that matches your habits,
and<00:01:51.240><c> keep</c><00:01:51.520><c> everything</c><00:01:52.040><c> organized</c><00:01:52.520><c> instead</c><00:01:52.920><c> of</c>

00:01:53.030 --> 00:01:53.040 align:start position:0%
and keep everything organized instead of
 

00:01:53.040 --> 00:01:55.230 align:start position:0%
and keep everything organized instead of
starting<00:01:53.480><c> from</c><00:01:53.720><c> scratch</c><00:01:54.200><c> every</c><00:01:54.520><c> time.</c><00:01:55.000><c> So,</c><00:01:55.120><c> if</c>

00:01:55.230 --> 00:01:55.240 align:start position:0%
starting from scratch every time. So, if
 

00:01:55.240 --> 00:01:57.230 align:start position:0%
starting from scratch every time. So, if
you<00:01:55.360><c> want</c><00:01:55.720><c> one</c><00:01:55.960><c> clean</c><00:01:56.280><c> place</c><00:01:56.600><c> to</c><00:01:56.720><c> use</c><00:01:56.880><c> the</c><00:01:57.000><c> best</c>

00:01:57.230 --> 00:01:57.240 align:start position:0%
you want one clean place to use the best
 

00:01:57.240 --> 00:01:59.150 align:start position:0%
you want one clean place to use the best
models<00:01:57.680><c> without</c><00:01:57.960><c> investing</c><00:01:58.560><c> in</c><00:01:58.720><c> so</c><00:01:58.920><c> much</c>

00:01:59.150 --> 00:01:59.160 align:start position:0%
models without investing in so much
 

00:01:59.160 --> 00:02:00.950 align:start position:0%
models without investing in so much
subscription<00:01:59.720><c> money,</c><00:02:00.240><c> check</c><00:02:00.400><c> them</c><00:02:00.560><c> out</c><00:02:00.720><c> now</c>

00:02:00.950 --> 00:02:00.960 align:start position:0%
subscription money, check them out now
 

00:02:00.960 --> 00:02:02.270 align:start position:0%
subscription money, check them out now
using<00:02:01.200><c> the</c><00:02:01.280><c> link</c><00:02:01.440><c> down</c><00:02:01.600><c> in</c><00:02:01.640><c> the</c><00:02:01.720><c> description,</c>

00:02:02.270 --> 00:02:02.280 align:start position:0%
using the link down in the description,
 

00:02:02.280 --> 00:02:03.990 align:start position:0%
using the link down in the description,
and<00:02:02.480><c> thank</c><00:02:02.760><c> you</c><00:02:02.880><c> Map</c><00:02:03.160><c> Muse</c><00:02:03.400><c> for</c><00:02:03.520><c> sponsoring</c>

00:02:03.990 --> 00:02:04.000 align:start position:0%
and thank you Map Muse for sponsoring
 

00:02:04.000 --> 00:02:05.670 align:start position:0%
and thank you Map Muse for sponsoring
this<00:02:04.160><c> video.</c><00:02:04.560><c> Anyways,</c><00:02:05.000><c> the</c><00:02:05.080><c> main</c><00:02:05.360><c> idea</c>

00:02:05.670 --> 00:02:05.680 align:start position:0%
this video. Anyways, the main idea
 

00:02:05.680 --> 00:02:07.470 align:start position:0%
this video. Anyways, the main idea
behind<00:02:06.040><c> evolution</c><00:02:06.440><c> strategies</c><00:02:06.920><c> is</c><00:02:07.120><c> actually</c>

00:02:07.470 --> 00:02:07.480 align:start position:0%
behind evolution strategies is actually
 

00:02:07.480 --> 00:02:09.270 align:start position:0%
behind evolution strategies is actually
very<00:02:07.720><c> simple.</c><00:02:08.200><c> You</c><00:02:08.320><c> start</c><00:02:08.520><c> with</c><00:02:08.679><c> one</c><00:02:08.920><c> version</c>

00:02:09.270 --> 00:02:09.280 align:start position:0%
very simple. You start with one version
 

00:02:09.280 --> 00:02:10.830 align:start position:0%
very simple. You start with one version
of<00:02:09.360><c> your</c><00:02:09.479><c> model,</c><00:02:09.960><c> then</c><00:02:10.160><c> you</c><00:02:10.280><c> create</c><00:02:10.520><c> several</c>

00:02:10.830 --> 00:02:10.840 align:start position:0%
of your model, then you create several
 

00:02:10.840 --> 00:02:12.390 align:start position:0%
of your model, then you create several
slightly<00:02:11.240><c> different</c><00:02:11.560><c> versions</c><00:02:11.920><c> of</c><00:02:12.040><c> it</c><00:02:12.160><c> by</c>

00:02:12.390 --> 00:02:12.400 align:start position:0%
slightly different versions of it by
 

00:02:12.400 --> 00:02:14.190 align:start position:0%
slightly different versions of it by
adding<00:02:12.760><c> small</c><00:02:13.040><c> random</c><00:02:13.360><c> changes.</c><00:02:13.920><c> Let's</c><00:02:14.080><c> take</c>

00:02:14.190 --> 00:02:14.200 align:start position:0%
adding small random changes. Let's take
 

00:02:14.200 --> 00:02:16.070 align:start position:0%
adding small random changes. Let's take
a<00:02:14.280><c> genetic</c><00:02:14.720><c> algorithm</c><00:02:15.240><c> as</c><00:02:15.360><c> an</c><00:02:15.480><c> example.</c><00:02:16.000><c> You</c>

00:02:16.070 --> 00:02:16.080 align:start position:0%
a genetic algorithm as an example. You
 

00:02:16.080 --> 00:02:18.190 align:start position:0%
a genetic algorithm as an example. You
basically<00:02:16.400><c> copy</c><00:02:16.720><c> a</c><00:02:16.800><c> group</c><00:02:17.040><c> of</c><00:02:17.160><c> the</c><00:02:17.240><c> same</c><00:02:17.520><c> DNA</c>

00:02:18.190 --> 00:02:18.200 align:start position:0%
basically copy a group of the same DNA
 

00:02:18.200 --> 00:02:20.110 align:start position:0%
basically copy a group of the same DNA
and<00:02:18.360><c> slightly</c><00:02:18.720><c> mutate</c><00:02:19.240><c> them,</c><00:02:19.520><c> then</c><00:02:19.680><c> you</c><00:02:19.800><c> test</c>

00:02:20.110 --> 00:02:20.120 align:start position:0%
and slightly mutate them, then you test
 

00:02:20.120 --> 00:02:21.870 align:start position:0%
and slightly mutate them, then you test
each<00:02:20.280><c> of</c><00:02:20.360><c> these</c><00:02:20.560><c> copies</c><00:02:21.080><c> and</c><00:02:21.200><c> measure</c><00:02:21.560><c> how</c>

00:02:21.870 --> 00:02:21.880 align:start position:0%
each of these copies and measure how
 

00:02:21.880 --> 00:02:23.430 align:start position:0%
each of these copies and measure how
well<00:02:22.120><c> they</c><00:02:22.320><c> perform.</c><00:02:22.800><c> This</c><00:02:22.960><c> performance</c>

00:02:23.430 --> 00:02:23.440 align:start position:0%
well they perform. This performance
 

00:02:23.440 --> 00:02:25.510 align:start position:0%
well they perform. This performance
score<00:02:23.760><c> is</c><00:02:23.920><c> called</c><00:02:24.400><c> their</c><00:02:24.680><c> fitness.</c><00:02:25.320><c> Some</c>

00:02:25.510 --> 00:02:25.520 align:start position:0%
score is called their fitness. Some
 

00:02:25.520 --> 00:02:27.310 align:start position:0%
score is called their fitness. Some
copies<00:02:25.920><c> will</c><00:02:26.040><c> do</c><00:02:26.200><c> better</c><00:02:26.560><c> and</c><00:02:26.760><c> some</c><00:02:26.960><c> will</c><00:02:27.120><c> do</c>

00:02:27.310 --> 00:02:27.320 align:start position:0%
copies will do better and some will do
 

00:02:27.320 --> 00:02:29.350 align:start position:0%
copies will do better and some will do
worse.<00:02:27.680><c> Once</c><00:02:27.920><c> you</c><00:02:28.040><c> see</c><00:02:28.280><c> which</c><00:02:28.480><c> copies</c><00:02:28.960><c> perform</c>

00:02:29.350 --> 00:02:29.360 align:start position:0%
worse. Once you see which copies perform
 

00:02:29.360 --> 00:02:31.310 align:start position:0%
worse. Once you see which copies perform
better,<00:02:29.960><c> you</c><00:02:30.000><c> then</c><00:02:30.200><c> use</c><00:02:30.400><c> that</c><00:02:30.560><c> information</c><00:02:31.200><c> to</c>

00:02:31.310 --> 00:02:31.320 align:start position:0%
better, you then use that information to
 

00:02:31.320 --> 00:02:33.190 align:start position:0%
better, you then use that information to
guide<00:02:31.560><c> the</c><00:02:31.640><c> next</c><00:02:31.880><c> step.</c><00:02:32.360><c> And</c><00:02:32.560><c> the</c><00:02:32.640><c> copies</c><00:02:33.040><c> that</c>

00:02:33.190 --> 00:02:33.200 align:start position:0%
guide the next step. And the copies that
 

00:02:33.200 --> 00:02:35.070 align:start position:0%
guide the next step. And the copies that
have<00:02:33.400><c> high</c><00:02:33.560><c> fitness</c><00:02:33.920><c> score</c><00:02:34.400><c> will</c><00:02:34.600><c> influence</c>

00:02:35.070 --> 00:02:35.080 align:start position:0%
have high fitness score will influence
 

00:02:35.080 --> 00:02:36.870 align:start position:0%
have high fitness score will influence
the<00:02:35.160><c> next</c><00:02:35.480><c> version</c><00:02:35.840><c> a</c><00:02:35.920><c> lot</c><00:02:36.120><c> more</c><00:02:36.280><c> strongly.</c>

00:02:36.870 --> 00:02:36.880 align:start position:0%
the next version a lot more strongly.
 

00:02:36.880 --> 00:02:39.030 align:start position:0%
the next version a lot more strongly.
The<00:02:36.960><c> ones</c><00:02:37.200><c> that</c><00:02:37.360><c> did</c><00:02:37.520><c> poorly</c><00:02:38.120><c> influence</c><00:02:38.600><c> less</c>

00:02:39.030 --> 00:02:39.040 align:start position:0%
The ones that did poorly influence less
 

00:02:39.040 --> 00:02:41.310 align:start position:0%
The ones that did poorly influence less
or<00:02:39.320><c> is</c><00:02:39.520><c> completely</c><00:02:40.000><c> discarded.</c><00:02:40.640><c> Then,</c><00:02:41.120><c> you</c>

00:02:41.310 --> 00:02:41.320 align:start position:0%
or is completely discarded. Then, you
 

00:02:41.320 --> 00:02:42.670 align:start position:0%
or is completely discarded. Then, you
repeat<00:02:41.720><c> the</c><00:02:41.800><c> whole</c><00:02:42.000><c> process.</c><00:02:42.480><c> So,</c><00:02:42.600><c> you</c>

00:02:42.670 --> 00:02:42.680 align:start position:0%
repeat the whole process. So, you
 

00:02:42.680 --> 00:02:44.750 align:start position:0%
repeat the whole process. So, you
basically<00:02:43.080><c> create</c><00:02:43.400><c> new</c><00:02:43.560><c> variations,</c><00:02:44.440><c> test</c>

00:02:44.750 --> 00:02:44.760 align:start position:0%
basically create new variations, test
 

00:02:44.760 --> 00:02:46.590 align:start position:0%
basically create new variations, test
them,<00:02:45.080><c> and</c><00:02:45.240><c> move</c><00:02:45.440><c> toward</c><00:02:45.760><c> the</c><00:02:45.840><c> variations</c>

00:02:46.590 --> 00:02:46.600 align:start position:0%
them, and move toward the variations
 

00:02:46.600 --> 00:02:48.190 align:start position:0%
them, and move toward the variations
that<00:02:46.800><c> worked</c><00:02:47.080><c> best.</c><00:02:47.480><c> And</c><00:02:47.600><c> over</c><00:02:47.800><c> time,</c><00:02:48.080><c> the</c>

00:02:48.190 --> 00:02:48.200 align:start position:0%
that worked best. And over time, the
 

00:02:48.200 --> 00:02:50.110 align:start position:0%
that worked best. And over time, the
model<00:02:48.480><c> improves</c><00:02:49.040><c> because</c><00:02:49.320><c> it</c><00:02:49.440><c> keeps</c><00:02:49.760><c> shifting</c>

00:02:50.110 --> 00:02:50.120 align:start position:0%
model improves because it keeps shifting
 

00:02:50.120 --> 00:02:51.790 align:start position:0%
model improves because it keeps shifting
towards<00:02:50.480><c> changes</c><00:02:50.880><c> that</c><00:02:51.040><c> increase</c><00:02:51.560><c> its</c>

00:02:51.790 --> 00:02:51.800 align:start position:0%
towards changes that increase its
 

00:02:51.800 --> 00:02:53.870 align:start position:0%
towards changes that increase its
fitness.<00:02:52.320><c> And</c><00:02:52.480><c> it</c><00:02:52.560><c> can</c><00:02:52.720><c> run</c><00:02:52.920><c> infinitely</c><00:02:53.520><c> until</c>

00:02:53.870 --> 00:02:53.880 align:start position:0%
fitness. And it can run infinitely until
 

00:02:53.880 --> 00:02:55.790 align:start position:0%
fitness. And it can run infinitely until
you<00:02:54.000><c> decide</c><00:02:54.360><c> to</c><00:02:54.440><c> stop.</c><00:02:54.840><c> So,</c><00:02:55.080><c> this</c><00:02:55.360><c> seemingly</c>

00:02:55.790 --> 00:02:55.800 align:start position:0%
you decide to stop. So, this seemingly
 

00:02:55.800 --> 00:02:57.990 align:start position:0%
you decide to stop. So, this seemingly
bulletproof<00:02:56.400><c> idea</c><00:02:56.840><c> sounds</c><00:02:57.120><c> like</c><00:02:57.480><c> it</c><00:02:57.640><c> should</c>

00:02:57.990 --> 00:02:58.000 align:start position:0%
bulletproof idea sounds like it should
 

00:02:58.000 --> 00:03:00.630 align:start position:0%
bulletproof idea sounds like it should
work<00:02:58.320><c> everywhere.</c><00:02:59.040><c> But</c><00:02:59.200><c> in</c><00:02:59.320><c> practice,</c><00:03:00.400><c> is</c>

00:03:00.630 --> 00:03:00.640 align:start position:0%
work everywhere. But in practice, is
 

00:03:00.640 --> 00:03:02.950 align:start position:0%
work everywhere. But in practice, is
actually<00:03:01.000><c> not</c><00:03:01.400><c> as</c><00:03:01.680><c> linear</c><00:03:02.120><c> as</c><00:03:02.320><c> it</c><00:03:02.440><c> seems.</c><00:03:02.880><c> In</c>

00:03:02.950 --> 00:03:02.960 align:start position:0%
actually not as linear as it seems. In
 

00:03:02.960 --> 00:03:04.550 align:start position:0%
actually not as linear as it seems. In
the<00:03:03.120><c> early</c><00:03:03.360><c> days</c><00:03:03.640><c> of</c><00:03:03.760><c> deep</c><00:03:04.000><c> learning,</c>

00:03:04.550 --> 00:03:04.560 align:start position:0%
the early days of deep learning,
 

00:03:04.560 --> 00:03:06.110 align:start position:0%
the early days of deep learning,
researchers<00:03:05.040><c> trained</c><00:03:05.360><c> neural</c><00:03:05.600><c> networks</c><00:03:06.040><c> to</c>

00:03:06.110 --> 00:03:06.120 align:start position:0%
researchers trained neural networks to
 

00:03:06.120 --> 00:03:08.430 align:start position:0%
researchers trained neural networks to
do<00:03:06.320><c> things</c><00:03:06.640><c> like</c><00:03:06.920><c> play</c><00:03:07.200><c> Atari</c><00:03:07.680><c> games.</c><00:03:08.120><c> Those</c>

00:03:08.430 --> 00:03:08.440 align:start position:0%
do things like play Atari games. Those
 

00:03:08.440 --> 00:03:10.230 align:start position:0%
do things like play Atari games. Those
networks<00:03:08.880><c> usually</c><00:03:09.160><c> had</c><00:03:09.320><c> around</c><00:03:09.720><c> 2</c><00:03:09.960><c> million</c>

00:03:10.230 --> 00:03:10.240 align:start position:0%
networks usually had around 2 million
 

00:03:10.240 --> 00:03:12.470 align:start position:0%
networks usually had around 2 million
parameters.<00:03:10.920><c> That</c><00:03:11.240><c> may</c><00:03:11.520><c> not</c><00:03:11.840><c> sound</c><00:03:12.120><c> huge</c>

00:03:12.470 --> 00:03:12.480 align:start position:0%
parameters. That may not sound huge
 

00:03:12.480 --> 00:03:14.470 align:start position:0%
parameters. That may not sound huge
today,<00:03:12.880><c> but</c><00:03:13.160><c> at</c><00:03:13.280><c> the</c><00:03:13.400><c> time,</c><00:03:13.880><c> especially</c><00:03:14.320><c> for</c>

00:03:14.470 --> 00:03:14.480 align:start position:0%
today, but at the time, especially for
 

00:03:14.480 --> 00:03:16.790 align:start position:0%
today, but at the time, especially for
evolutionary<00:03:15.120><c> methods,</c><00:03:15.800><c> it</c><00:03:16.000><c> was</c><00:03:16.280><c> already</c>

00:03:16.790 --> 00:03:16.800 align:start position:0%
evolutionary methods, it was already
 

00:03:16.800 --> 00:03:19.070 align:start position:0%
evolutionary methods, it was already
extremely<00:03:17.320><c> large.</c><00:03:17.720><c> And</c><00:03:17.920><c> optimizing</c><00:03:18.520><c> it</c><00:03:18.760><c> using</c>

00:03:19.070 --> 00:03:19.080 align:start position:0%
extremely large. And optimizing it using
 

00:03:19.080 --> 00:03:20.790 align:start position:0%
extremely large. And optimizing it using
evolution<00:03:19.560><c> strategies</c><00:03:20.120><c> is</c><00:03:20.240><c> like</c><00:03:20.400><c> trying</c><00:03:20.680><c> to</c>

00:03:20.790 --> 00:03:20.800 align:start position:0%
evolution strategies is like trying to
 

00:03:20.800 --> 00:03:22.910 align:start position:0%
evolution strategies is like trying to
randomly<00:03:21.360><c> tweak</c><00:03:21.760><c> 2</c><00:03:22.000><c> million</c><00:03:22.400><c> knobs</c><00:03:22.760><c> at</c><00:03:22.840><c> the</c>

00:03:22.910 --> 00:03:22.920 align:start position:0%
randomly tweak 2 million knobs at the
 

00:03:22.920 --> 00:03:24.910 align:start position:0%
randomly tweak 2 million knobs at the
same<00:03:23.160><c> time</c><00:03:23.560><c> and</c><00:03:23.760><c> hoping</c><00:03:24.120><c> you</c><00:03:24.240><c> would</c><00:03:24.440><c> randomly</c>

00:03:24.910 --> 00:03:24.920 align:start position:0%
same time and hoping you would randomly
 

00:03:24.920 --> 00:03:26.950 align:start position:0%
same time and hoping you would randomly
get<00:03:25.120><c> improvements</c><00:03:25.760><c> out</c><00:03:25.920><c> of</c><00:03:26.040><c> that,</c><00:03:26.360><c> which</c>

00:03:26.950 --> 00:03:26.960 align:start position:0%
get improvements out of that, which
 

00:03:26.960 --> 00:03:29.310 align:start position:0%
get improvements out of that, which
seems<00:03:27.520><c> worse</c><00:03:27.880><c> than</c><00:03:28.120><c> gambling.</c><00:03:28.840><c> Most</c><00:03:29.000><c> random</c>

00:03:29.310 --> 00:03:29.320 align:start position:0%
seems worse than gambling. Most random
 

00:03:29.320 --> 00:03:31.150 align:start position:0%
seems worse than gambling. Most random
changes<00:03:29.760><c> will</c><00:03:29.960><c> also</c><00:03:30.200><c> completely</c><00:03:30.680><c> scramble</c>

00:03:31.150 --> 00:03:31.160 align:start position:0%
changes will also completely scramble
 

00:03:31.160 --> 00:03:32.870 align:start position:0%
changes will also completely scramble
the<00:03:31.240><c> model's</c><00:03:31.600><c> behavior,</c><00:03:32.080><c> so</c><00:03:32.200><c> the</c><00:03:32.320><c> model</c><00:03:32.640><c> might</c>

00:03:32.870 --> 00:03:32.880 align:start position:0%
the model's behavior, so the model might
 

00:03:32.880 --> 00:03:35.110 align:start position:0%
the model's behavior, so the model might
go<00:03:33.040><c> from</c><00:03:33.320><c> playing</c><00:03:33.680><c> somewhat</c><00:03:34.080><c> reasonably</c><00:03:34.680><c> to</c>

00:03:35.110 --> 00:03:35.120 align:start position:0%
go from playing somewhat reasonably to
 

00:03:35.120 --> 00:03:36.910 align:start position:0%
go from playing somewhat reasonably to
acting<00:03:35.520><c> almost</c><00:03:35.920><c> randomly.</c><00:03:36.152><c> [music]</c><00:03:36.480><c> And</c><00:03:36.760><c> when</c>

00:03:36.910 --> 00:03:36.920 align:start position:0%
acting almost randomly. [music] And when
 

00:03:36.920 --> 00:03:38.510 align:start position:0%
acting almost randomly. [music] And when
nearly<00:03:37.280><c> all</c><00:03:37.400><c> mutations</c><00:03:38.040><c> destroy</c>

00:03:38.510 --> 00:03:38.520 align:start position:0%
nearly all mutations destroy
 

00:03:38.520 --> 00:03:40.430 align:start position:0%
nearly all mutations destroy
performance,<00:03:39.240><c> it</c><00:03:39.360><c> becomes</c><00:03:39.800><c> very</c><00:03:40.120><c> hard</c><00:03:40.360><c> to</c>

00:03:40.430 --> 00:03:40.440 align:start position:0%
performance, it becomes very hard to
 

00:03:40.440 --> 00:03:42.430 align:start position:0%
performance, it becomes very hard to
find<00:03:40.680><c> the</c><00:03:40.800><c> rare</c><00:03:41.160><c> ones</c><00:03:41.480><c> that</c><00:03:41.640><c> actually</c><00:03:41.960><c> improve</c>

00:03:42.430 --> 00:03:42.440 align:start position:0%
find the rare ones that actually improve
 

00:03:42.440 --> 00:03:44.590 align:start position:0%
find the rare ones that actually improve
it,<00:03:42.720><c> resulting</c><00:03:43.280><c> in</c><00:03:43.440><c> the</c><00:03:43.600><c> good</c><00:03:43.880><c> signal</c><00:03:44.280><c> getting</c>

00:03:44.590 --> 00:03:44.600 align:start position:0%
it, resulting in the good signal getting
 

00:03:44.600 --> 00:03:46.870 align:start position:0%
it, resulting in the good signal getting
buried<00:03:45.160><c> under</c><00:03:45.440><c> noise.</c><00:03:45.920><c> On</c><00:03:46.000><c> top</c><00:03:46.200><c> of</c><00:03:46.320><c> that,</c><00:03:46.640><c> deep</c>

00:03:46.870 --> 00:03:46.880 align:start position:0%
buried under noise. On top of that, deep
 

00:03:46.880 --> 00:03:48.310 align:start position:0%
buried under noise. On top of that, deep
neural<00:03:47.120><c> network</c><00:03:47.440><c> parameters</c><00:03:47.960><c> are</c><00:03:48.080><c> not</c>

00:03:48.310 --> 00:03:48.320 align:start position:0%
neural network parameters are not
 

00:03:48.320 --> 00:03:50.630 align:start position:0%
neural network parameters are not
independent<00:03:48.880><c> knobs</c><00:03:49.240><c> like</c><00:03:49.560><c> genes</c><00:03:49.920><c> in</c><00:03:50.120><c> genetic</c>

00:03:50.630 --> 00:03:50.640 align:start position:0%
independent knobs like genes in genetic
 

00:03:50.640 --> 00:03:52.310 align:start position:0%
independent knobs like genes in genetic
algorithms,<00:03:51.280><c> which</c><00:03:51.480><c> are</c><00:03:51.600><c> simple</c><00:03:52.120><c> and</c>

00:03:52.310 --> 00:03:52.320 align:start position:0%
algorithms, which are simple and
 

00:03:52.320 --> 00:03:54.710 align:start position:0%
algorithms, which are simple and
unrelated<00:03:52.960><c> DNA</c><00:03:53.400><c> string.</c><00:03:53.880><c> The</c><00:03:54.000><c> parameters</c><00:03:54.600><c> in</c>

00:03:54.710 --> 00:03:54.720 align:start position:0%
unrelated DNA string. The parameters in
 

00:03:54.720 --> 00:03:56.150 align:start position:0%
unrelated DNA string. The parameters in
neural<00:03:54.920><c> networks</c><00:03:55.400><c> are</c><00:03:55.720><c> highly</c>

00:03:56.150 --> 00:03:56.160 align:start position:0%
neural networks are highly
 

00:03:56.160 --> 00:03:57.790 align:start position:0%
neural networks are highly
interconnected<00:03:56.880><c> with</c><00:03:57.080><c> each</c><00:03:57.280><c> other.</c><00:03:57.640><c> So,</c>

00:03:57.790 --> 00:03:57.800 align:start position:0%
interconnected with each other. So,
 

00:03:57.800 --> 00:04:00.510 align:start position:0%
interconnected with each other. So,
changing<00:03:58.480><c> one</c><00:03:58.800><c> weight</c><00:03:59.120><c> slightly</c><00:03:59.680><c> can</c><00:04:00.200><c> change</c>

00:04:00.510 --> 00:04:00.520 align:start position:0%
changing one weight slightly can change
 

00:04:00.520 --> 00:04:01.790 align:start position:0%
changing one weight slightly can change
how<00:04:00.640><c> many</c><00:04:00.800><c> other</c><00:04:01.040><c> weights</c><00:04:01.360><c> behave</c>

00:04:01.790 --> 00:04:01.800 align:start position:0%
how many other weights behave
 

00:04:01.800 --> 00:04:03.590 align:start position:0%
how many other weights behave
downstream.<00:04:02.440><c> So,</c><00:04:02.600><c> usually</c><00:04:03.160><c> it'll</c><00:04:03.360><c> probably</c>

00:04:03.590 --> 00:04:03.600 align:start position:0%
downstream. So, usually it'll probably
 

00:04:03.600 --> 00:04:05.190 align:start position:0%
downstream. So, usually it'll probably
bring<00:04:03.920><c> more</c><00:04:04.200><c> destruction</c><00:04:05.040><c> than</c>

00:04:05.190 --> 00:04:05.200 align:start position:0%
bring more destruction than
 

00:04:05.200 --> 00:04:07.430 align:start position:0%
bring more destruction than
improvements.<00:04:06.080><c> Some</c><00:04:06.320><c> methods</c><00:04:06.800><c> did</c><00:04:06.960><c> try</c><00:04:07.280><c> to</c>

00:04:07.430 --> 00:04:07.440 align:start position:0%
improvements. Some methods did try to
 

00:04:07.440 --> 00:04:09.310 align:start position:0%
improvements. Some methods did try to
model<00:04:07.800><c> how</c><00:04:07.920><c> parameters</c><00:04:08.400><c> interact</c><00:04:08.920><c> with</c><00:04:09.080><c> each</c>

00:04:09.310 --> 00:04:09.320 align:start position:0%
model how parameters interact with each
 

00:04:09.320 --> 00:04:11.190 align:start position:0%
model how parameters interact with each
other<00:04:09.560><c> by</c><00:04:09.720><c> learning</c><00:04:10.040><c> a</c><00:04:10.120><c> large</c><00:04:10.480><c> covariance</c>

00:04:11.190 --> 00:04:11.200 align:start position:0%
other by learning a large covariance
 

00:04:11.200 --> 00:04:13.390 align:start position:0%
other by learning a large covariance
matrix.<00:04:11.720><c> But</c><00:04:11.960><c> for</c><00:04:12.160><c> a</c><00:04:12.240><c> network</c><00:04:12.720><c> with</c><00:04:12.920><c> 2</c><00:04:13.120><c> million</c>

00:04:13.390 --> 00:04:13.400 align:start position:0%
matrix. But for a network with 2 million
 

00:04:13.400 --> 00:04:15.590 align:start position:0%
matrix. But for a network with 2 million
parameters,<00:04:14.080><c> that</c><00:04:14.360><c> matrix</c><00:04:14.880><c> would</c><00:04:15.120><c> contain</c>

00:04:15.590 --> 00:04:15.600 align:start position:0%
parameters, that matrix would contain
 

00:04:15.600 --> 00:04:17.630 align:start position:0%
parameters, that matrix would contain
trillions<00:04:16.239><c> of</c><00:04:16.440><c> entries.</c><00:04:17.000><c> And</c><00:04:17.120><c> if</c><00:04:17.239><c> you</c><00:04:17.359><c> store</c>

00:04:17.630 --> 00:04:17.640 align:start position:0%
trillions of entries. And if you store
 

00:04:17.640 --> 00:04:19.470 align:start position:0%
trillions of entries. And if you store
and<00:04:17.760><c> update</c><00:04:18.000><c> something</c><00:04:18.320><c> that</c><00:04:18.560><c> large,</c><00:04:19.160><c> it</c><00:04:19.320><c> is</c>

00:04:19.470 --> 00:04:19.480 align:start position:0%
and update something that large, it is
 

00:04:19.480 --> 00:04:21.430 align:start position:0%
and update something that large, it is
pretty<00:04:19.720><c> much</c><00:04:20.040><c> impossible</c><00:04:20.560><c> to</c><00:04:20.680><c> train</c><00:04:21.000><c> and</c><00:04:21.200><c> even</c>

00:04:21.430 --> 00:04:21.440 align:start position:0%
pretty much impossible to train and even
 

00:04:21.440 --> 00:04:23.430 align:start position:0%
pretty much impossible to train and even
use.<00:04:21.799><c> So,</c><00:04:22.040><c> older</c><00:04:22.320><c> evolution</c><00:04:22.800><c> strategies</c>

00:04:23.430 --> 00:04:23.440 align:start position:0%
use. So, older evolution strategies
 

00:04:23.440 --> 00:04:25.710 align:start position:0%
use. So, older evolution strategies
simply<00:04:23.920><c> could</c><00:04:24.160><c> not</c><00:04:24.440><c> scale</c><00:04:24.920><c> to</c><00:04:25.160><c> deep</c><00:04:25.440><c> neural</c>

00:04:25.710 --> 00:04:25.720 align:start position:0%
simply could not scale to deep neural
 

00:04:25.720 --> 00:04:27.950 align:start position:0%
simply could not scale to deep neural
networks.<00:04:26.280><c> But</c><00:04:26.440><c> in</c><00:04:26.600><c> OpenAI's</c><00:04:27.080><c> 2017</c><00:04:27.600><c> paper</c>

00:04:27.950 --> 00:04:27.960 align:start position:0%
networks. But in OpenAI's 2017 paper
 

00:04:27.960 --> 00:04:29.550 align:start position:0%
networks. But in OpenAI's 2017 paper
called<00:04:28.400><c> Evolution</c><00:04:28.880><c> Strategies</c><00:04:29.400><c> as</c><00:04:29.520><c> a</c>

00:04:29.550 --> 00:04:29.560 align:start position:0%
called Evolution Strategies as a
 

00:04:29.560 --> 00:04:31.310 align:start position:0%
called Evolution Strategies as a
scalable<00:04:30.120><c> alternative</c><00:04:30.520><c> to</c><00:04:30.680><c> reinforcement</c>

00:04:31.310 --> 00:04:31.320 align:start position:0%
scalable alternative to reinforcement
 

00:04:31.320 --> 00:04:32.710 align:start position:0%
scalable alternative to reinforcement
learning,<00:04:31.800><c> they</c><00:04:31.960><c> changed</c><00:04:32.320><c> the</c><00:04:32.400><c> way</c><00:04:32.560><c> of</c>

00:04:32.710 --> 00:04:32.720 align:start position:0%
learning, they changed the way of
 

00:04:32.720 --> 00:04:34.830 align:start position:0%
learning, they changed the way of
implementing<00:04:33.320><c> it</c><00:04:33.520><c> for</c><00:04:33.680><c> neural</c><00:04:34.080><c> networks.</c><00:04:34.760><c> So,</c>

00:04:34.830 --> 00:04:34.840 align:start position:0%
implementing it for neural networks. So,
 

00:04:34.840 --> 00:04:36.350 align:start position:0%
implementing it for neural networks. So,
instead<00:04:35.160><c> of</c><00:04:35.240><c> trying</c><00:04:35.480><c> to</c><00:04:35.640><c> learn</c><00:04:35.840><c> a</c><00:04:35.920><c> huge</c><00:04:36.240><c> and</c>

00:04:36.350 --> 00:04:36.360 align:start position:0%
instead of trying to learn a huge and
 

00:04:36.360 --> 00:04:38.030 align:start position:0%
instead of trying to learn a huge and
complicated<00:04:36.880><c> structure</c><00:04:37.240><c> that</c><00:04:37.440><c> models</c><00:04:37.800><c> how</c>

00:04:38.030 --> 00:04:38.040 align:start position:0%
complicated structure that models how
 

00:04:38.040 --> 00:04:39.630 align:start position:0%
complicated structure that models how
all<00:04:38.160><c> the</c><00:04:38.280><c> parameters</c><00:04:38.760><c> interact</c><00:04:39.240><c> with</c><00:04:39.440><c> each</c>

00:04:39.630 --> 00:04:39.640 align:start position:0%
all the parameters interact with each
 

00:04:39.640 --> 00:04:41.550 align:start position:0%
all the parameters interact with each
other,<00:04:40.040><c> for</c><00:04:40.200><c> example,</c><00:04:40.760><c> the</c><00:04:40.880><c> covariance</c>

00:04:41.550 --> 00:04:41.560 align:start position:0%
other, for example, the covariance
 

00:04:41.560 --> 00:04:44.270 align:start position:0%
other, for example, the covariance
matrix,<00:04:42.200><c> they</c><00:04:42.480><c> used</c><00:04:42.840><c> basic</c><00:04:43.320><c> Gaussian</c><00:04:43.800><c> noise.</c>

00:04:44.270 --> 00:04:44.280 align:start position:0%
matrix, they used basic Gaussian noise.
 

00:04:44.280 --> 00:04:45.870 align:start position:0%
matrix, they used basic Gaussian noise.
This<00:04:44.440><c> slightly</c><00:04:44.800><c> nudges</c><00:04:45.200><c> all</c><00:04:45.320><c> the</c><00:04:45.440><c> parameters</c>

00:04:45.870 --> 00:04:45.880 align:start position:0%
This slightly nudges all the parameters
 

00:04:45.880 --> 00:04:47.990 align:start position:0%
This slightly nudges all the parameters
in<00:04:46.040><c> random</c><00:04:46.320><c> directions</c><00:04:47.040><c> and</c><00:04:47.320><c> then</c><00:04:47.560><c> measures</c>

00:04:47.990 --> 00:04:48.000 align:start position:0%
in random directions and then measures
 

00:04:48.000 --> 00:04:49.750 align:start position:0%
in random directions and then measures
how<00:04:48.240><c> the</c><00:04:48.360><c> performances</c><00:04:49.080><c> are</c><00:04:49.200><c> changed.</c><00:04:49.640><c> For</c>

00:04:49.750 --> 00:04:49.760 align:start position:0%
how the performances are changed. For
 

00:04:49.760 --> 00:04:51.510 align:start position:0%
how the performances are changed. For
example,<00:04:50.360><c> let's</c><00:04:50.560><c> say</c><00:04:50.680><c> you</c><00:04:50.760><c> have</c><00:04:50.880><c> a</c><00:04:50.920><c> population</c>

00:04:51.510 --> 00:04:51.520 align:start position:0%
example, let's say you have a population
 

00:04:51.520 --> 00:04:53.790 align:start position:0%
example, let's say you have a population
of<00:04:51.680><c> nine</c><00:04:51.960><c> models</c><00:04:52.440><c> in</c><00:04:52.640><c> one</c><00:04:52.880><c> iteration</c><00:04:53.560><c> where</c>

00:04:53.790 --> 00:04:53.800 align:start position:0%
of nine models in one iteration where
 

00:04:53.800 --> 00:04:55.470 align:start position:0%
of nine models in one iteration where
every<00:04:54.040><c> model</c><00:04:54.320><c> will</c><00:04:54.480><c> get</c><00:04:54.680><c> a</c><00:04:54.720><c> full</c><00:04:55.000><c> parameter</c>

00:04:55.470 --> 00:04:55.480 align:start position:0%
every model will get a full parameter
 

00:04:55.480 --> 00:04:57.710 align:start position:0%
every model will get a full parameter
update<00:04:55.880><c> in</c><00:04:56.120><c> all</c><00:04:56.320><c> nine</c><00:04:56.600><c> random</c><00:04:56.960><c> directions.</c>

00:04:57.710 --> 00:04:57.720 align:start position:0%
update in all nine random directions.
 

00:04:57.720 --> 00:04:59.590 align:start position:0%
update in all nine random directions.
You<00:04:57.920><c> then</c><00:04:58.200><c> evaluate</c><00:04:58.760><c> all</c><00:04:58.920><c> nine</c><00:04:59.160><c> updated</c>

00:04:59.590 --> 00:04:59.600 align:start position:0%
You then evaluate all nine updated
 

00:04:59.600 --> 00:05:01.270 align:start position:0%
You then evaluate all nine updated
models<00:05:00.200><c> and</c><00:05:00.240><c> see</c><00:05:00.520><c> who</c><00:05:00.720><c> has</c><00:05:00.920><c> the</c><00:05:01.040><c> best</c>

00:05:01.270 --> 00:05:01.280 align:start position:0%
models and see who has the best
 

00:05:01.280 --> 00:05:03.110 align:start position:0%
models and see who has the best
performance<00:05:01.760><c> given</c><00:05:02.040><c> a</c><00:05:02.120><c> task.</c><00:05:02.640><c> And</c><00:05:02.880><c> there</c>

00:05:03.110 --> 00:05:03.120 align:start position:0%
performance given a task. And there
 

00:05:03.120 --> 00:05:05.030 align:start position:0%
performance given a task. And there
would<00:05:03.360><c> be</c><00:05:03.520><c> a</c><00:05:03.600><c> scalar</c><00:05:03.960><c> reward</c><00:05:04.360><c> to</c><00:05:04.480><c> indicate</c><00:05:04.880><c> the</c>

00:05:05.030 --> 00:05:05.040 align:start position:0%
would be a scalar reward to indicate the
 

00:05:05.040 --> 00:05:06.870 align:start position:0%
would be a scalar reward to indicate the
random<00:05:05.440><c> direction's</c><00:05:05.960><c> performance.</c><00:05:06.600><c> After</c>

00:05:06.870 --> 00:05:06.880 align:start position:0%
random direction's performance. After
 

00:05:06.880 --> 00:05:08.550 align:start position:0%
random direction's performance. After
you<00:05:07.040><c> evaluate</c><00:05:07.560><c> every</c><00:05:07.800><c> random</c><00:05:08.080><c> direction's</c>

00:05:08.550 --> 00:05:08.560 align:start position:0%
you evaluate every random direction's
 

00:05:08.560 --> 00:05:09.990 align:start position:0%
you evaluate every random direction's
effectiveness,<00:05:09.360><c> everything</c><00:05:09.680><c> will</c><00:05:09.800><c> be</c>

00:05:09.990 --> 00:05:10.000 align:start position:0%
effectiveness, everything will be
 

00:05:10.000 --> 00:05:11.990 align:start position:0%
effectiveness, everything will be
weighted<00:05:10.480><c> corresponds</c><00:05:11.080><c> to</c><00:05:11.200><c> its</c><00:05:11.360><c> performance.</c>

00:05:11.990 --> 00:05:12.000 align:start position:0%
weighted corresponds to its performance.
 

00:05:12.000 --> 00:05:13.870 align:start position:0%
weighted corresponds to its performance.
And<00:05:12.120><c> a</c><00:05:12.160><c> proper</c><00:05:12.480><c> update</c><00:05:12.840><c> for</c><00:05:13.040><c> all</c><00:05:13.240><c> nine</c><00:05:13.480><c> models</c>

00:05:13.870 --> 00:05:13.880 align:start position:0%
And a proper update for all nine models
 

00:05:13.880 --> 00:05:15.710 align:start position:0%
And a proper update for all nine models
of<00:05:13.960><c> the</c><00:05:14.080><c> weighted</c><00:05:14.480><c> average</c><00:05:15.000><c> will</c><00:05:15.240><c> be</c><00:05:15.440><c> shared</c>

00:05:15.710 --> 00:05:15.720 align:start position:0%
of the weighted average will be shared
 

00:05:15.720 --> 00:05:17.630 align:start position:0%
of the weighted average will be shared
across<00:05:16.120><c> all</c><00:05:16.280><c> models.</c><00:05:16.920><c> Then,</c><00:05:17.200><c> you</c><00:05:17.280><c> basically</c>

00:05:17.630 --> 00:05:17.640 align:start position:0%
across all models. Then, you basically
 

00:05:17.640 --> 00:05:19.270 align:start position:0%
across all models. Then, you basically
repeat<00:05:18.000><c> this</c><00:05:18.160><c> process.</c><00:05:18.720><c> If</c><00:05:18.840><c> you</c><00:05:18.960><c> scale</c><00:05:19.200><c> the</c>

00:05:19.270 --> 00:05:19.280 align:start position:0%
repeat this process. If you scale the
 

00:05:19.280 --> 00:05:21.390 align:start position:0%
repeat this process. If you scale the
population<00:05:19.880><c> by</c><00:05:20.080><c> running</c><00:05:20.520><c> hundreds</c><00:05:20.920><c> or</c><00:05:21.120><c> even</c>

00:05:21.390 --> 00:05:21.400 align:start position:0%
population by running hundreds or even
 

00:05:21.400 --> 00:05:23.630 align:start position:0%
population by running hundreds or even
more<00:05:21.600><c> than</c><00:05:21.800><c> a</c><00:05:21.840><c> thousand</c><00:05:22.400><c> at</c><00:05:22.600><c> once,</c><00:05:23.120><c> they</c><00:05:23.320><c> could</c>

00:05:23.630 --> 00:05:23.640 align:start position:0%
more than a thousand at once, they could
 

00:05:23.640 --> 00:05:26.110 align:start position:0%
more than a thousand at once, they could
average<00:05:24.120><c> over</c><00:05:24.400><c> many</c><00:05:24.720><c> random</c><00:05:25.240><c> perturbations</c>

00:05:26.110 --> 00:05:26.120 align:start position:0%
average over many random perturbations
 

00:05:26.120 --> 00:05:28.270 align:start position:0%
average over many random perturbations
and<00:05:26.440><c> eventually</c><00:05:27.120><c> the</c><00:05:27.280><c> noise</c><00:05:27.760><c> will</c><00:05:27.880><c> start</c><00:05:28.200><c> to</c>

00:05:28.270 --> 00:05:28.280 align:start position:0%
and eventually the noise will start to
 

00:05:28.280 --> 00:05:30.070 align:start position:0%
and eventually the noise will start to
cancel<00:05:28.680><c> out</c><00:05:28.880><c> and</c><00:05:29.000><c> the</c><00:05:29.080><c> useful</c><00:05:29.320><c> direction</c>

00:05:30.070 --> 00:05:30.080 align:start position:0%
cancel out and the useful direction
 

00:05:30.080 --> 00:05:31.550 align:start position:0%
cancel out and the useful direction
would<00:05:30.240><c> naturally</c><00:05:30.720><c> emerge.</c><00:05:31.120><c> That's</c><00:05:31.360><c> why</c>

00:05:31.550 --> 00:05:31.560 align:start position:0%
would naturally emerge. That's why
 

00:05:31.560 --> 00:05:33.830 align:start position:0%
would naturally emerge. That's why
evolution<00:05:32.000><c> strategies</c><00:05:32.640><c> were</c><00:05:33.000><c> revived.</c><00:05:33.680><c> Not</c>

00:05:33.830 --> 00:05:33.840 align:start position:0%
evolution strategies were revived. Not
 

00:05:33.840 --> 00:05:35.590 align:start position:0%
evolution strategies were revived. Not
because<00:05:34.120><c> the</c><00:05:34.200><c> core</c><00:05:34.440><c> idea</c><00:05:34.760><c> changed,</c><00:05:35.320><c> but</c>

00:05:35.590 --> 00:05:35.600 align:start position:0%
because the core idea changed, but
 

00:05:35.600 --> 00:05:37.470 align:start position:0%
because the core idea changed, but
because<00:05:36.040><c> the</c><00:05:36.240><c> engineering</c><00:05:36.760><c> made</c><00:05:37.000><c> the</c><00:05:37.080><c> method</c>

00:05:37.470 --> 00:05:37.480 align:start position:0%
because the engineering made the method
 

00:05:37.480 --> 00:05:38.990 align:start position:0%
because the engineering made the method
viable<00:05:37.880><c> for</c><00:05:38.040><c> deep</c><00:05:38.320><c> neural</c><00:05:38.600><c> networks,</c>

00:05:38.990 --> 00:05:39.000 align:start position:0%
viable for deep neural networks,
 

00:05:39.000 --> 00:05:40.470 align:start position:0%
viable for deep neural networks,
especially<00:05:39.480><c> in</c><00:05:39.640><c> deep</c><00:05:39.880><c> reinforcement</c>

00:05:40.470 --> 00:05:40.480 align:start position:0%
especially in deep reinforcement
 

00:05:40.480 --> 00:05:42.510 align:start position:0%
especially in deep reinforcement
learning<00:05:40.840><c> for</c><00:05:41.040><c> Atari</c><00:05:41.520><c> games.</c><00:05:41.960><c> This</c><00:05:42.160><c> OpenAI</c>

00:05:42.510 --> 00:05:42.520 align:start position:0%
learning for Atari games. This OpenAI
 

00:05:42.520 --> 00:05:44.030 align:start position:0%
learning for Atari games. This OpenAI
research<00:05:42.920><c> was</c><00:05:43.120><c> the</c><00:05:43.240><c> first</c><00:05:43.600><c> time</c><00:05:43.880><c> that</c>

00:05:44.030 --> 00:05:44.040 align:start position:0%
research was the first time that
 

00:05:44.040 --> 00:05:46.230 align:start position:0%
research was the first time that
evolution<00:05:44.600><c> strategies</c><00:05:45.320><c> worked</c><00:05:45.760><c> on</c><00:05:46.000><c> deep</c>

00:05:46.230 --> 00:05:46.240 align:start position:0%
evolution strategies worked on deep
 

00:05:46.240 --> 00:05:48.270 align:start position:0%
evolution strategies worked on deep
neural<00:05:46.520><c> networks,</c><00:05:46.960><c> which</c><00:05:47.240><c> is</c><00:05:47.600><c> a</c><00:05:47.680><c> pivotal</c>

00:05:48.270 --> 00:05:48.280 align:start position:0%
neural networks, which is a pivotal
 

00:05:48.280 --> 00:05:50.350 align:start position:0%
neural networks, which is a pivotal
paper.<00:05:48.680><c> So,</c><00:05:48.880><c> as</c><00:05:49.040><c> we</c><00:05:49.160><c> now</c><00:05:49.400><c> know</c><00:05:49.680><c> that</c><00:05:49.960><c> evolution</c>

00:05:50.350 --> 00:05:50.360 align:start position:0%
paper. So, as we now know that evolution
 

00:05:50.360 --> 00:05:52.670 align:start position:0%
paper. So, as we now know that evolution
strategies<00:05:50.880><c> could</c><00:05:51.280><c> be</c><00:05:51.440><c> a</c><00:05:51.520><c> good</c><00:05:51.760><c> optimizer,</c>

00:05:52.670 --> 00:05:52.680 align:start position:0%
strategies could be a good optimizer,
 

00:05:52.680 --> 00:05:55.110 align:start position:0%
strategies could be a good optimizer,
then<00:05:53.120><c> what</c><00:05:53.280><c> if</c><00:05:53.400><c> we</c><00:05:53.560><c> use</c><00:05:53.760><c> it</c><00:05:53.880><c> to</c><00:05:54.000><c> optimize</c><00:05:54.480><c> LLMs?</c>

00:05:55.110 --> 00:05:55.120 align:start position:0%
then what if we use it to optimize LLMs?
 

00:05:55.120 --> 00:05:57.110 align:start position:0%
then what if we use it to optimize LLMs?
Well,<00:05:55.400><c> the</c><00:05:55.520><c> practical</c><00:05:55.960><c> truth</c><00:05:56.280><c> is,</c><00:05:56.680><c> due</c><00:05:56.840><c> to</c><00:05:56.960><c> how</c>

00:05:57.110 --> 00:05:57.120 align:start position:0%
Well, the practical truth is, due to how
 

00:05:57.120 --> 00:05:58.950 align:start position:0%
Well, the practical truth is, due to how
the<00:05:57.280><c> learning</c><00:05:57.680><c> is</c><00:05:57.840><c> set</c><00:05:58.120><c> up,</c><00:05:58.520><c> next</c><00:05:58.760><c> token</c>

00:05:58.950 --> 00:05:58.960 align:start position:0%
the learning is set up, next token
 

00:05:58.960 --> 00:06:01.470 align:start position:0%
the learning is set up, next token
prediction<00:05:59.520><c> is</c><00:05:59.920><c> easy</c><00:06:00.360><c> for</c><00:06:00.560><c> gradients,</c><00:06:01.120><c> but</c>

00:06:01.470 --> 00:06:01.480 align:start position:0%
prediction is easy for gradients, but
 

00:06:01.480 --> 00:06:03.310 align:start position:0%
prediction is easy for gradients, but
hard<00:06:01.840><c> for</c><00:06:02.040><c> evolution</c><00:06:02.480><c> strategies.</c><00:06:03.080><c> Because</c>

00:06:03.310 --> 00:06:03.320 align:start position:0%
hard for evolution strategies. Because
 

00:06:03.320 --> 00:06:05.190 align:start position:0%
hard for evolution strategies. Because
with<00:06:03.480><c> next</c><00:06:03.720><c> token</c><00:06:03.960><c> prediction,</c><00:06:04.640><c> you</c><00:06:04.840><c> have</c><00:06:05.120><c> a</c>

00:06:05.190 --> 00:06:05.200 align:start position:0%
with next token prediction, you have a
 

00:06:05.200 --> 00:06:07.990 align:start position:0%
with next token prediction, you have a
clear<00:06:05.520><c> teacher</c><00:06:05.880><c> signal</c><00:06:06.400><c> at</c><00:06:06.880><c> every</c><00:06:07.400><c> token.</c><00:06:07.880><c> So,</c>

00:06:07.990 --> 00:06:08.000 align:start position:0%
clear teacher signal at every token. So,
 

00:06:08.000 --> 00:06:09.430 align:start position:0%
clear teacher signal at every token. So,
the<00:06:08.120><c> correct</c><00:06:08.400><c> next</c><00:06:08.720><c> word</c><00:06:09.000><c> will</c><00:06:09.200><c> always</c>

00:06:09.430 --> 00:06:09.440 align:start position:0%
the correct next word will always
 

00:06:09.440 --> 00:06:11.790 align:start position:0%
the correct next word will always
provide<00:06:09.840><c> good</c><00:06:10.160><c> loss</c><00:06:10.560><c> that</c><00:06:10.800><c> is</c><00:06:11.040><c> both</c><00:06:11.400><c> smooth</c>

00:06:11.790 --> 00:06:11.800 align:start position:0%
provide good loss that is both smooth
 

00:06:11.800 --> 00:06:13.670 align:start position:0%
provide good loss that is both smooth
and<00:06:12.000><c> differentiable.</c><00:06:12.840><c> But</c><00:06:13.000><c> in</c><00:06:13.160><c> evolution</c>

00:06:13.670 --> 00:06:13.680 align:start position:0%
and differentiable. But in evolution
 

00:06:13.680 --> 00:06:15.390 align:start position:0%
and differentiable. But in evolution
strategies'<00:06:14.160><c> case,</c><00:06:14.560><c> we</c><00:06:14.840><c> are</c><00:06:14.960><c> basically</c>

00:06:15.390 --> 00:06:15.400 align:start position:0%
strategies' case, we are basically
 

00:06:15.400 --> 00:06:17.190 align:start position:0%
strategies' case, we are basically
throwing<00:06:15.760><c> away</c><00:06:16.040><c> most</c><00:06:16.280><c> of</c><00:06:16.360><c> that</c><00:06:16.520><c> information</c>

00:06:17.190 --> 00:06:17.200 align:start position:0%
throwing away most of that information
 

00:06:17.200 --> 00:06:19.390 align:start position:0%
throwing away most of that information
and<00:06:17.360><c> replacing</c><00:06:18.000><c> it</c><00:06:18.200><c> with</c><00:06:18.440><c> a</c><00:06:18.520><c> single</c><00:06:19.040><c> scalar</c>

00:06:19.390 --> 00:06:19.400 align:start position:0%
and replacing it with a single scalar
 

00:06:19.400 --> 00:06:21.550 align:start position:0%
and replacing it with a single scalar
reward,<00:06:19.920><c> which</c><00:06:20.080><c> is</c><00:06:20.240><c> kind</c><00:06:20.440><c> of</c><00:06:20.520><c> like</c><00:06:20.800><c> an</c><00:06:21.080><c> average</c>

00:06:21.550 --> 00:06:21.560 align:start position:0%
reward, which is kind of like an average
 

00:06:21.560 --> 00:06:23.910 align:start position:0%
reward, which is kind of like an average
loss<00:06:21.920><c> of</c><00:06:22.280><c> everything.</c><00:06:22.920><c> And</c><00:06:23.080><c> this</c><00:06:23.400><c> is</c><00:06:23.600><c> a</c><00:06:23.640><c> lot</c>

00:06:23.910 --> 00:06:23.920 align:start position:0%
loss of everything. And this is a lot
 

00:06:23.920 --> 00:06:25.670 align:start position:0%
loss of everything. And this is a lot
less<00:06:24.120><c> meaningful</c><00:06:24.680><c> than</c><00:06:24.960><c> what</c><00:06:25.200><c> next</c><00:06:25.440><c> token</c>

00:06:25.670 --> 00:06:25.680 align:start position:0%
less meaningful than what next token
 

00:06:25.680 --> 00:06:27.950 align:start position:0%
less meaningful than what next token
prediction<00:06:26.160><c> can</c><00:06:26.400><c> provide.</c><00:06:27.040><c> On</c><00:06:27.200><c> top</c><00:06:27.440><c> of</c><00:06:27.560><c> that,</c>

00:06:27.950 --> 00:06:27.960 align:start position:0%
prediction can provide. On top of that,
 

00:06:27.960 --> 00:06:29.910 align:start position:0%
prediction can provide. On top of that,
evolution<00:06:28.360><c> strategies</c><00:06:28.960><c> takes</c><00:06:29.360><c> a</c><00:06:29.400><c> huge</c><00:06:29.680><c> amount</c>

00:06:29.910 --> 00:06:29.920 align:start position:0%
evolution strategies takes a huge amount
 

00:06:29.920 --> 00:06:31.910 align:start position:0%
evolution strategies takes a huge amount
of<00:06:30.000><c> compute</c><00:06:30.760><c> while</c><00:06:30.840><c> giving</c><00:06:31.200><c> very</c><00:06:31.520><c> fuzzy</c>

00:06:31.910 --> 00:06:31.920 align:start position:0%
of compute while giving very fuzzy
 

00:06:31.920 --> 00:06:33.870 align:start position:0%
of compute while giving very fuzzy
signal.<00:06:32.480><c> But</c><00:06:32.680><c> one</c><00:06:32.880><c> gradient</c><00:06:33.240><c> step</c><00:06:33.520><c> in</c><00:06:33.640><c> next</c>

00:06:33.870 --> 00:06:33.880 align:start position:0%
signal. But one gradient step in next
 

00:06:33.880 --> 00:06:35.830 align:start position:0%
signal. But one gradient step in next
token<00:06:34.080><c> prediction</c><00:06:34.560><c> tells</c><00:06:34.840><c> you</c><00:06:34.960><c> exactly</c><00:06:35.520><c> how</c>

00:06:35.830 --> 00:06:35.840 align:start position:0%
token prediction tells you exactly how
 

00:06:35.840 --> 00:06:38.070 align:start position:0%
token prediction tells you exactly how
wrong<00:06:36.160><c> you</c><00:06:36.320><c> are.</c><00:06:36.640><c> However,</c><00:06:37.160><c> all</c><00:06:37.400><c> hope</c><00:06:37.720><c> is</c><00:06:37.880><c> not</c>

00:06:38.070 --> 00:06:38.080 align:start position:0%
wrong you are. However, all hope is not
 

00:06:38.080 --> 00:06:39.990 align:start position:0%
wrong you are. However, all hope is not
lost.<00:06:38.480><c> Reinforcement</c><00:06:39.120><c> learning</c><00:06:39.440><c> in</c><00:06:39.600><c> LLM</c>

00:06:39.990 --> 00:06:40.000 align:start position:0%
lost. Reinforcement learning in LLM
 

00:06:40.000 --> 00:06:42.390 align:start position:0%
lost. Reinforcement learning in LLM
fine-tuning<00:06:40.720><c> is</c><00:06:41.000><c> the</c><00:06:41.280><c> opposite</c><00:06:41.720><c> situation</c>

00:06:42.390 --> 00:06:42.400 align:start position:0%
fine-tuning is the opposite situation
 

00:06:42.400 --> 00:06:44.390 align:start position:0%
fine-tuning is the opposite situation
compared<00:06:42.920><c> to</c><00:06:43.080><c> next</c><00:06:43.360><c> token</c><00:06:43.600><c> prediction</c><00:06:44.160><c> used</c>

00:06:44.390 --> 00:06:44.400 align:start position:0%
compared to next token prediction used
 

00:06:44.400 --> 00:06:46.350 align:start position:0%
compared to next token prediction used
during<00:06:44.680><c> pre-training.</c><00:06:45.360><c> In</c><00:06:45.560><c> LLM</c><00:06:45.960><c> RL</c>

00:06:46.350 --> 00:06:46.360 align:start position:0%
during pre-training. In LLM RL
 

00:06:46.360 --> 00:06:48.670 align:start position:0%
during pre-training. In LLM RL
fine-tuning,<00:06:47.040><c> you</c><00:06:47.240><c> often</c><00:06:47.600><c> only</c><00:06:47.840><c> get</c><00:06:48.040><c> a</c><00:06:48.120><c> single</c>

00:06:48.670 --> 00:06:48.680 align:start position:0%
fine-tuning, you often only get a single
 

00:06:48.680 --> 00:06:50.910 align:start position:0%
fine-tuning, you often only get a single
score<00:06:49.080><c> for</c><00:06:49.320><c> the</c><00:06:49.520><c> whole</c><00:06:49.880><c> generated</c><00:06:50.360><c> answer.</c>

00:06:50.910 --> 00:06:50.920 align:start position:0%
score for the whole generated answer.
 

00:06:50.920 --> 00:06:52.630 align:start position:0%
score for the whole generated answer.
You<00:06:51.040><c> do</c><00:06:51.160><c> not</c><00:06:51.360><c> get</c><00:06:51.520><c> a</c><00:06:51.600><c> clean</c><00:06:52.080><c> signal.</c><00:06:52.520><c> For</c>

00:06:52.630 --> 00:06:52.640 align:start position:0%
You do not get a clean signal. For
 

00:06:52.640 --> 00:06:54.430 align:start position:0%
You do not get a clean signal. For
example,<00:06:53.240><c> which</c><00:06:53.480><c> token</c><00:06:53.880><c> makes</c><00:06:54.080><c> the</c><00:06:54.200><c> most</c>

00:06:54.430 --> 00:06:54.440 align:start position:0%
example, which token makes the most
 

00:06:54.440 --> 00:06:56.190 align:start position:0%
example, which token makes the most
difference<00:06:54.920><c> in</c><00:06:55.080><c> this</c><00:06:55.280><c> sentence.</c><00:06:55.920><c> So,</c><00:06:56.040><c> we</c>

00:06:56.190 --> 00:06:56.200 align:start position:0%
difference in this sentence. So, we
 

00:06:56.200 --> 00:06:58.550 align:start position:0%
difference in this sentence. So, we
often<00:06:56.520><c> see</c><00:06:56.720><c> in</c><00:06:56.920><c> RLVR</c><00:06:57.600><c> research</c><00:06:58.160><c> that</c><00:06:58.440><c> the</c>

00:06:58.550 --> 00:06:58.560 align:start position:0%
often see in RLVR research that the
 

00:06:58.560 --> 00:07:00.510 align:start position:0%
often see in RLVR research that the
learning<00:06:58.920><c> signal</c><00:06:59.320><c> is</c><00:06:59.520><c> way</c><00:06:59.640><c> too</c><00:07:00.000><c> sparse</c><00:07:00.240><c> due</c><00:07:00.360><c> to</c>

00:07:00.510 --> 00:07:00.520 align:start position:0%
learning signal is way too sparse due to
 

00:07:00.520 --> 00:07:02.390 align:start position:0%
learning signal is way too sparse due to
how<00:07:00.760><c> for</c><00:07:00.960><c> a</c><00:07:01.040><c> piece</c><00:07:01.320><c> of</c><00:07:01.440><c> training</c><00:07:01.800><c> data,</c><00:07:02.240><c> you</c>

00:07:02.390 --> 00:07:02.400 align:start position:0%
how for a piece of training data, you
 

00:07:02.400 --> 00:07:04.830 align:start position:0%
how for a piece of training data, you
sometimes<00:07:02.960><c> only</c><00:07:03.280><c> get</c><00:07:03.600><c> a</c><00:07:03.680><c> binary</c><00:07:04.360><c> feedback</c>

00:07:04.830 --> 00:07:04.840 align:start position:0%
sometimes only get a binary feedback
 

00:07:04.840 --> 00:07:06.710 align:start position:0%
sometimes only get a binary feedback
that<00:07:05.040><c> is</c><00:07:05.240><c> then</c><00:07:05.520><c> used</c><00:07:05.720><c> to</c><00:07:05.800><c> update</c><00:07:06.080><c> a</c><00:07:06.120><c> billion</c><00:07:06.600><c> or</c>

00:07:06.710 --> 00:07:06.720 align:start position:0%
that is then used to update a billion or
 

00:07:06.720 --> 00:07:08.470 align:start position:0%
that is then used to update a billion or
even<00:07:06.960><c> a</c><00:07:07.040><c> trillion</c><00:07:07.480><c> parameter</c><00:07:07.880><c> model.</c><00:07:08.360><c> So,</c>

00:07:08.470 --> 00:07:08.480 align:start position:0%
even a trillion parameter model. So,
 

00:07:08.480 --> 00:07:09.670 align:start position:0%
even a trillion parameter model. So,
with<00:07:08.640><c> a</c><00:07:08.680><c> lot</c><00:07:08.880><c> of</c><00:07:08.960><c> new</c><00:07:09.120><c> research</c><00:07:09.440><c> trying</c><00:07:09.600><c> to</c>

00:07:09.670 --> 00:07:09.680 align:start position:0%
with a lot of new research trying to
 

00:07:09.680 --> 00:07:11.150 align:start position:0%
with a lot of new research trying to
figure<00:07:09.960><c> out</c><00:07:10.040><c> how</c><00:07:10.160><c> to</c><00:07:10.240><c> provide</c><00:07:10.600><c> more</c><00:07:10.800><c> learning</c>

00:07:11.150 --> 00:07:11.160 align:start position:0%
figure out how to provide more learning
 

00:07:11.160 --> 00:07:13.390 align:start position:0%
figure out how to provide more learning
signals<00:07:11.640><c> in</c><00:07:11.840><c> RLVR</c><00:07:12.520><c> processing,</c><00:07:13.200><c> for</c>

00:07:13.390 --> 00:07:13.400 align:start position:0%
signals in RLVR processing, for
 

00:07:13.400 --> 00:07:15.310 align:start position:0%
signals in RLVR processing, for
instance,<00:07:13.920><c> token</c><00:07:14.200><c> level</c><00:07:14.440><c> credit</c><00:07:14.760><c> assignment,</c>

00:07:15.310 --> 00:07:15.320 align:start position:0%
instance, token level credit assignment,
 

00:07:15.320 --> 00:07:16.870 align:start position:0%
instance, token level credit assignment,
which<00:07:15.520><c> I</c><00:07:15.600><c> talked</c><00:07:15.920><c> about</c><00:07:16.120><c> before.</c><00:07:16.680><c> This</c>

00:07:16.870 --> 00:07:16.880 align:start position:0%
which I talked about before. This
 

00:07:16.880 --> 00:07:18.910 align:start position:0%
which I talked about before. This
situation,<00:07:17.360><c> however,</c><00:07:17.720><c> is</c><00:07:17.920><c> exactly</c><00:07:18.480><c> the</c><00:07:18.600><c> kind</c>

00:07:18.910 --> 00:07:18.920 align:start position:0%
situation, however, is exactly the kind
 

00:07:18.920 --> 00:07:20.790 align:start position:0%
situation, however, is exactly the kind
of<00:07:19.040><c> setting</c><00:07:19.480><c> where</c><00:07:19.720><c> evolution</c><00:07:20.200><c> strategies</c>

00:07:20.790 --> 00:07:20.800 align:start position:0%
of setting where evolution strategies
 

00:07:20.800 --> 00:07:22.390 align:start position:0%
of setting where evolution strategies
would<00:07:21.120><c> actually</c><00:07:21.440><c> make</c><00:07:21.640><c> sense.</c><00:07:22.160><c> Since</c>

00:07:22.390 --> 00:07:22.400 align:start position:0%
would actually make sense. Since
 

00:07:22.400 --> 00:07:24.430 align:start position:0%
would actually make sense. Since
evolution<00:07:22.800><c> strategies</c><00:07:23.320><c> only</c><00:07:23.600><c> needs</c><00:07:23.880><c> a</c><00:07:23.960><c> reward</c>

00:07:24.430 --> 00:07:24.440 align:start position:0%
evolution strategies only needs a reward
 

00:07:24.440 --> 00:07:26.030 align:start position:0%
evolution strategies only needs a reward
for<00:07:24.520><c> the</c><00:07:24.680><c> whole</c><00:07:25.000><c> outcome,</c><00:07:25.520><c> and</c><00:07:25.680><c> it</c><00:07:25.760><c> doesn't</c>

00:07:26.030 --> 00:07:26.040 align:start position:0%
for the whole outcome, and it doesn't
 

00:07:26.040 --> 00:07:27.790 align:start position:0%
for the whole outcome, and it doesn't
need<00:07:26.200><c> to</c><00:07:26.320><c> backpropagate</c><00:07:27.200><c> through</c><00:07:27.440><c> a</c><00:07:27.520><c> long</c>

00:07:27.790 --> 00:07:27.800 align:start position:0%
need to backpropagate through a long
 

00:07:27.800 --> 00:07:30.030 align:start position:0%
need to backpropagate through a long
sequence<00:07:28.240><c> or</c><00:07:28.360><c> decide</c><00:07:28.760><c> which</c><00:07:28.960><c> token</c><00:07:29.360><c> deserves</c>

00:07:30.030 --> 00:07:30.040 align:start position:0%
sequence or decide which token deserves
 

00:07:30.040 --> 00:07:31.390 align:start position:0%
sequence or decide which token deserves
the<00:07:30.120><c> credit.</c><00:07:30.520><c> The</c><00:07:30.640><c> way</c><00:07:30.800><c> that</c><00:07:30.960><c> it</c><00:07:31.080><c> treats</c><00:07:31.320><c> the</c>

00:07:31.390 --> 00:07:31.400 align:start position:0%
the credit. The way that it treats the
 

00:07:31.400 --> 00:07:33.270 align:start position:0%
the credit. The way that it treats the
model<00:07:31.680><c> as</c><00:07:31.760><c> a</c><00:07:31.800><c> black</c><00:07:32.080><c> box</c><00:07:32.440><c> consequently</c>

00:07:33.270 --> 00:07:33.280 align:start position:0%
model as a black box consequently
 

00:07:33.280 --> 00:07:35.470 align:start position:0%
model as a black box consequently
provides<00:07:33.760><c> larger</c><00:07:34.120><c> parameter</c><00:07:34.640><c> updates,</c><00:07:35.200><c> which</c>

00:07:35.470 --> 00:07:35.480 align:start position:0%
provides larger parameter updates, which
 

00:07:35.480 --> 00:07:37.310 align:start position:0%
provides larger parameter updates, which
theoretically<00:07:36.280><c> should</c><00:07:36.520><c> be</c><00:07:36.760><c> able</c><00:07:37.000><c> to</c><00:07:37.080><c> give</c>

00:07:37.310 --> 00:07:37.320 align:start position:0%
theoretically should be able to give
 

00:07:37.320 --> 00:07:40.230 align:start position:0%
theoretically should be able to give
stronger<00:07:37.720><c> feedback</c><00:07:38.200><c> than</c><00:07:38.400><c> RLVR</c><00:07:39.240><c> like</c><00:07:39.440><c> GRPO.</c>

00:07:40.230 --> 00:07:40.240 align:start position:0%
stronger feedback than RLVR like GRPO.
 

00:07:40.240 --> 00:07:42.470 align:start position:0%
stronger feedback than RLVR like GRPO.
And<00:07:40.440><c> this</c><00:07:40.880><c> is</c><00:07:41.160><c> exactly</c><00:07:41.800><c> what</c><00:07:41.960><c> the</c><00:07:42.040><c> paper</c>

00:07:42.470 --> 00:07:42.480 align:start position:0%
And this is exactly what the paper
 

00:07:42.480 --> 00:07:44.510 align:start position:0%
And this is exactly what the paper
Evolution<00:07:43.000><c> Strategies</c><00:07:43.520><c> at</c><00:07:43.680><c> Scale</c><00:07:44.200><c> published</c>

00:07:44.510 --> 00:07:44.520 align:start position:0%
Evolution Strategies at Scale published
 

00:07:44.520 --> 00:07:47.150 align:start position:0%
Evolution Strategies at Scale published
back<00:07:44.720><c> in</c><00:07:44.840><c> September</c><00:07:45.320><c> 2025</c><00:07:46.160><c> has</c><00:07:46.440><c> found</c><00:07:46.760><c> out.</c><00:07:47.040><c> In</c>

00:07:47.150 --> 00:07:47.160 align:start position:0%
back in September 2025 has found out. In
 

00:07:47.160 --> 00:07:48.910 align:start position:0%
back in September 2025 has found out. In
their<00:07:47.320><c> setup,</c><00:07:47.800><c> evolution</c><00:07:48.200><c> strategies</c><00:07:48.680><c> does</c>

00:07:48.910 --> 00:07:48.920 align:start position:0%
their setup, evolution strategies does
 

00:07:48.920 --> 00:07:50.950 align:start position:0%
their setup, evolution strategies does
not<00:07:49.160><c> need</c><00:07:49.360><c> token</c><00:07:49.680><c> level</c><00:07:49.960><c> rewards</c><00:07:50.440><c> and</c><00:07:50.640><c> only</c>

00:07:50.950 --> 00:07:50.960 align:start position:0%
not need token level rewards and only
 

00:07:50.960 --> 00:07:52.870 align:start position:0%
not need token level rewards and only
needs<00:07:51.200><c> a</c><00:07:51.280><c> response</c><00:07:51.760><c> level</c><00:07:52.040><c> reward</c><00:07:52.360><c> for</c><00:07:52.640><c> each</c>

00:07:52.870 --> 00:07:52.880 align:start position:0%
needs a response level reward for each
 

00:07:52.880 --> 00:07:54.630 align:start position:0%
needs a response level reward for each
batch<00:07:53.160><c> of</c><00:07:53.320><c> perturbations,</c><00:07:54.200><c> which</c><00:07:54.400><c> kind</c><00:07:54.560><c> of</c>

00:07:54.630 --> 00:07:54.640 align:start position:0%
batch of perturbations, which kind of
 

00:07:54.640 --> 00:07:56.110 align:start position:0%
batch of perturbations, which kind of
makes<00:07:54.800><c> it</c><00:07:54.880><c> a</c><00:07:54.960><c> perfect</c><00:07:55.360><c> match</c><00:07:55.680><c> for</c><00:07:55.880><c> long</c>

00:07:56.110 --> 00:07:56.120 align:start position:0%
makes it a perfect match for long
 

00:07:56.120 --> 00:07:58.190 align:start position:0%
makes it a perfect match for long
horizon<00:07:56.520><c> outcome</c><00:07:56.880><c> only</c><00:07:57.160><c> tasks</c><00:07:57.720><c> where</c><00:07:57.880><c> credit</c>

00:07:58.190 --> 00:07:58.200 align:start position:0%
horizon outcome only tasks where credit
 

00:07:58.200 --> 00:08:00.350 align:start position:0%
horizon outcome only tasks where credit
assignment<00:07:58.640><c> is</c><00:07:58.800><c> a</c><00:07:58.840><c> lot</c><00:07:59.120><c> harder</c><00:07:59.520><c> to</c><00:07:59.680><c> attribute.</c>

00:08:00.350 --> 00:08:00.360 align:start position:0%
assignment is a lot harder to attribute.
 

00:08:00.360 --> 00:08:02.270 align:start position:0%
assignment is a lot harder to attribute.
On<00:08:00.400><c> top</c><00:08:00.680><c> of</c><00:08:00.800><c> that,</c><00:08:01.200><c> this</c><00:08:01.400><c> paper</c><00:08:01.680><c> is</c><00:08:01.800><c> the</c><00:08:01.920><c> first</c>

00:08:02.270 --> 00:08:02.280 align:start position:0%
On top of that, this paper is the first
 

00:08:02.280 --> 00:08:04.150 align:start position:0%
On top of that, this paper is the first
paper<00:08:02.560><c> that</c><00:08:02.800><c> tested</c><00:08:03.120><c> evolution</c><00:08:03.600><c> strategies</c>

00:08:04.150 --> 00:08:04.160 align:start position:0%
paper that tested evolution strategies
 

00:08:04.160 --> 00:08:06.390 align:start position:0%
paper that tested evolution strategies
on<00:08:04.360><c> a</c><00:08:04.400><c> model</c><00:08:04.840><c> with</c><00:08:05.160><c> billions</c><00:08:05.640><c> of</c><00:08:05.760><c> parameters.</c>

00:08:06.390 --> 00:08:06.400 align:start position:0%
on a model with billions of parameters.
 

00:08:06.400 --> 00:08:08.150 align:start position:0%
on a model with billions of parameters.
It<00:08:06.520><c> replaced</c><00:08:07.000><c> the</c><00:08:07.120><c> idea</c><00:08:07.400><c> of</c><00:08:07.600><c> action</c><00:08:07.920><c> space</c>

00:08:08.150 --> 00:08:08.160 align:start position:0%
It replaced the idea of action space
 

00:08:08.160 --> 00:08:10.030 align:start position:0%
It replaced the idea of action space
exploration<00:08:08.880><c> to</c><00:08:09.040><c> parameter</c><00:08:09.640><c> space</c>

00:08:10.030 --> 00:08:10.040 align:start position:0%
exploration to parameter space
 

00:08:10.040 --> 00:08:11.950 align:start position:0%
exploration to parameter space
exploration.<00:08:10.840><c> Because</c><00:08:11.120><c> in</c><00:08:11.360><c> action</c><00:08:11.680><c> space</c>

00:08:11.950 --> 00:08:11.960 align:start position:0%
exploration. Because in action space
 

00:08:11.960 --> 00:08:14.310 align:start position:0%
exploration. Because in action space
exploration,<00:08:12.920><c> each</c><00:08:13.120><c> sampled</c><00:08:13.560><c> sequence</c><00:08:14.040><c> is</c><00:08:14.240><c> a</c>

00:08:14.310 --> 00:08:14.320 align:start position:0%
exploration, each sampled sequence is a
 

00:08:14.320 --> 00:08:16.470 align:start position:0%
exploration, each sampled sequence is a
small<00:08:14.600><c> variation</c><00:08:15.280><c> of</c><00:08:15.520><c> what</c><00:08:15.720><c> the</c><00:08:15.800><c> same</c><00:08:16.120><c> model</c>

00:08:16.470 --> 00:08:16.480 align:start position:0%
small variation of what the same model
 

00:08:16.480 --> 00:08:18.430 align:start position:0%
small variation of what the same model
would<00:08:16.640><c> normally</c><00:08:17.120><c> say.</c><00:08:17.560><c> The</c><00:08:17.640><c> model's</c><00:08:17.960><c> internal</c>

00:08:18.430 --> 00:08:18.440 align:start position:0%
would normally say. The model's internal
 

00:08:18.440 --> 00:08:20.430 align:start position:0%
would normally say. The model's internal
reasoning<00:08:18.800><c> structure</c><00:08:19.320><c> is</c><00:08:19.640><c> unchanged.</c><00:08:20.320><c> You're</c>

00:08:20.430 --> 00:08:20.440 align:start position:0%
reasoning structure is unchanged. You're
 

00:08:20.440 --> 00:08:21.990 align:start position:0%
reasoning structure is unchanged. You're
just<00:08:20.680><c> basically</c><00:08:21.040><c> sampling</c><00:08:21.480><c> from</c><00:08:21.720><c> what</c><00:08:21.920><c> the</c>

00:08:21.990 --> 00:08:22.000 align:start position:0%
just basically sampling from what the
 

00:08:22.000 --> 00:08:24.110 align:start position:0%
just basically sampling from what the
model<00:08:22.440><c> already</c><00:08:22.840><c> knows.</c><00:08:23.280><c> But</c><00:08:23.520><c> in</c><00:08:23.680><c> parameter</c>

00:08:24.110 --> 00:08:24.120 align:start position:0%
model already knows. But in parameter
 

00:08:24.120 --> 00:08:26.310 align:start position:0%
model already knows. But in parameter
space<00:08:24.400><c> exploration,</c><00:08:25.440><c> each</c><00:08:25.680><c> perturbation</c>

00:08:26.310 --> 00:08:26.320 align:start position:0%
space exploration, each perturbation
 

00:08:26.320 --> 00:08:28.270 align:start position:0%
space exploration, each perturbation
slightly<00:08:26.840><c> changes</c><00:08:27.360><c> the</c><00:08:27.440><c> model's</c><00:08:27.880><c> reasoning</c>

00:08:28.270 --> 00:08:28.280 align:start position:0%
slightly changes the model's reasoning
 

00:08:28.280 --> 00:08:30.430 align:start position:0%
slightly changes the model's reasoning
behavior<00:08:28.760><c> itself.</c><00:08:29.400><c> One</c><00:08:29.520><c> perturbation</c><00:08:30.160><c> might</c>

00:08:30.430 --> 00:08:30.440 align:start position:0%
behavior itself. One perturbation might
 

00:08:30.440 --> 00:08:32.190 align:start position:0%
behavior itself. One perturbation might
make<00:08:30.600><c> the</c><00:08:30.720><c> model</c><00:08:31.040><c> more</c><00:08:31.200><c> concise,</c><00:08:31.840><c> another</c>

00:08:32.190 --> 00:08:32.200 align:start position:0%
make the model more concise, another
 

00:08:32.200 --> 00:08:34.469 align:start position:0%
make the model more concise, another
might<00:08:32.440><c> make</c><00:08:32.640><c> it</c><00:08:32.760><c> more</c><00:08:33.000><c> verbose,</c><00:08:33.640><c> maybe</c><00:08:34.200><c> even</c>

00:08:34.469 --> 00:08:34.479 align:start position:0%
might make it more verbose, maybe even
 

00:08:34.479 --> 00:08:36.430 align:start position:0%
might make it more verbose, maybe even
discovering<00:08:35.080><c> a</c><00:08:35.159><c> new</c><00:08:35.520><c> reasoning</c><00:08:35.919><c> approach.</c>

00:08:36.430 --> 00:08:36.440 align:start position:0%
discovering a new reasoning approach.
 

00:08:36.440 --> 00:08:38.070 align:start position:0%
discovering a new reasoning approach.
Because<00:08:36.719><c> what</c><00:08:36.880><c> evolution</c><00:08:37.320><c> strategies</c><00:08:37.800><c> does</c>

00:08:38.070 --> 00:08:38.080 align:start position:0%
Because what evolution strategies does
 

00:08:38.080 --> 00:08:40.550 align:start position:0%
Because what evolution strategies does
is<00:08:38.400><c> provide</c><00:08:38.840><c> structural</c><00:08:39.440><c> behavior</c><00:08:40.000><c> changes,</c>

00:08:40.550 --> 00:08:40.560 align:start position:0%
is provide structural behavior changes,
 

00:08:40.560 --> 00:08:42.230 align:start position:0%
is provide structural behavior changes,
not<00:08:40.800><c> just</c><00:08:41.039><c> token</c><00:08:41.360><c> level</c><00:08:41.599><c> randomness.</c>

00:08:42.230 --> 00:08:42.240 align:start position:0%
not just token level randomness.
 

00:08:42.240 --> 00:08:44.110 align:start position:0%
not just token level randomness.
Especially<00:08:42.719><c> when</c><00:08:42.919><c> it</c><00:08:43.039><c> is</c><00:08:43.200><c> restricted</c><00:08:43.800><c> to</c><00:08:43.919><c> its</c>

00:08:44.110 --> 00:08:44.120 align:start position:0%
Especially when it is restricted to its
 

00:08:44.120 --> 00:08:45.550 align:start position:0%
Especially when it is restricted to its
own<00:08:44.280><c> knowledge</c><00:08:44.680><c> base</c><00:08:45.000><c> and</c><00:08:45.200><c> just</c><00:08:45.400><c> be</c>

00:08:45.550 --> 00:08:45.560 align:start position:0%
own knowledge base and just be
 

00:08:45.560 --> 00:08:47.550 align:start position:0%
own knowledge base and just be
reinforcing<00:08:46.360><c> a</c><00:08:46.440><c> pre-existing</c><00:08:47.160><c> sampling</c>

00:08:47.550 --> 00:08:47.560 align:start position:0%
reinforcing a pre-existing sampling
 

00:08:47.560 --> 00:08:49.510 align:start position:0%
reinforcing a pre-existing sampling
distribution.<00:08:48.240><c> This</c><00:08:48.520><c> blew</c><00:08:48.760><c> away</c><00:08:49.200><c> all</c>

00:08:49.510 --> 00:08:49.520 align:start position:0%
distribution. This blew away all
 

00:08:49.520 --> 00:08:51.430 align:start position:0%
distribution. This blew away all
previous<00:08:50.000><c> expectations</c><00:08:50.760><c> of</c><00:08:50.960><c> evolution</c>

00:08:51.430 --> 00:08:51.440 align:start position:0%
previous expectations of evolution
 

00:08:51.440 --> 00:08:53.550 align:start position:0%
previous expectations of evolution
strategies,<00:08:52.040><c> especially</c><00:08:52.480><c> the</c><00:08:52.640><c> assumption</c><00:08:53.320><c> of</c>

00:08:53.550 --> 00:08:53.560 align:start position:0%
strategies, especially the assumption of
 

00:08:53.560 --> 00:08:55.030 align:start position:0%
strategies, especially the assumption of
it<00:08:53.680><c> cannot</c><00:08:54.040><c> scale</c><00:08:54.400><c> beyond</c><00:08:54.760><c> million</c>

00:08:55.030 --> 00:08:55.040 align:start position:0%
it cannot scale beyond million
 

00:08:55.040 --> 00:08:56.870 align:start position:0%
it cannot scale beyond million
parameters.<00:08:55.720><c> And</c><00:08:55.840><c> the</c><00:08:55.960><c> reason</c><00:08:56.440><c> why</c><00:08:56.600><c> it's</c><00:08:56.760><c> so</c>

00:08:56.870 --> 00:08:56.880 align:start position:0%
parameters. And the reason why it's so
 

00:08:56.880 --> 00:08:58.430 align:start position:0%
parameters. And the reason why it's so
surprising<00:08:57.360><c> is</c><00:08:57.480><c> that</c><00:08:57.720><c> ever</c><00:08:57.920><c> since</c><00:08:58.200><c> that</c>

00:08:58.430 --> 00:08:58.440 align:start position:0%
surprising is that ever since that
 

00:08:58.440 --> 00:09:00.590 align:start position:0%
surprising is that ever since that
OpenAI<00:08:58.840><c> paper,</c><00:08:59.280><c> it</c><00:08:59.400><c> was</c><00:08:59.600><c> widely</c><00:09:00.040><c> assumed</c>

00:09:00.590 --> 00:09:00.600 align:start position:0%
OpenAI paper, it was widely assumed
 

00:09:00.600 --> 00:09:02.310 align:start position:0%
OpenAI paper, it was widely assumed
evolution<00:09:01.080><c> strategies</c><00:09:01.600><c> would</c><00:09:01.800><c> not</c><00:09:02.000><c> be</c><00:09:02.120><c> able</c>

00:09:02.310 --> 00:09:02.320 align:start position:0%
evolution strategies would not be able
 

00:09:02.320 --> 00:09:04.710 align:start position:0%
evolution strategies would not be able
to<00:09:02.400><c> scale</c><00:09:02.640><c> up</c><00:09:02.800><c> to</c><00:09:03.000><c> LLM</c><00:09:03.440><c> sized</c><00:09:03.840><c> models.</c><00:09:04.440><c> This</c>

00:09:04.710 --> 00:09:04.720 align:start position:0%
to scale up to LLM sized models. This
 

00:09:04.720 --> 00:09:06.590 align:start position:0%
to scale up to LLM sized models. This
simply<00:09:05.080><c> because</c><00:09:05.400><c> exploring</c><00:09:06.080><c> in</c><00:09:06.200><c> parameter</c>

00:09:06.590 --> 00:09:06.600 align:start position:0%
simply because exploring in parameter
 

00:09:06.600 --> 00:09:08.550 align:start position:0%
simply because exploring in parameter
space<00:09:07.000><c> gets</c><00:09:07.320><c> harder</c><00:09:07.840><c> as</c><00:09:08.040><c> the</c><00:09:08.120><c> number</c><00:09:08.440><c> of</c>

00:09:08.550 --> 00:09:08.560 align:start position:0%
space gets harder as the number of
 

00:09:08.560 --> 00:09:11.150 align:start position:0%
space gets harder as the number of
parameters<00:09:09.120><c> grows,</c><00:09:09.560><c> and</c><00:09:09.760><c> modern</c><00:09:10.120><c> LLMs</c><00:09:10.720><c> have</c>

00:09:11.150 --> 00:09:11.160 align:start position:0%
parameters grows, and modern LLMs have
 

00:09:11.160 --> 00:09:12.910 align:start position:0%
parameters grows, and modern LLMs have
billions<00:09:11.600><c> of</c><00:09:11.760><c> them.</c><00:09:12.080><c> Especially</c><00:09:12.600><c> how</c><00:09:12.800><c> the</c>

00:09:12.910 --> 00:09:12.920 align:start position:0%
billions of them. Especially how the
 

00:09:12.920 --> 00:09:15.150 align:start position:0%
billions of them. Especially how the
relationships<00:09:13.800><c> would</c><00:09:14.120><c> be</c><00:09:14.360><c> in</c><00:09:14.520><c> much</c><00:09:14.880><c> higher</c>

00:09:15.150 --> 00:09:15.160 align:start position:0%
relationships would be in much higher
 

00:09:15.160 --> 00:09:17.310 align:start position:0%
relationships would be in much higher
dimensions<00:09:15.880><c> to</c><00:09:16.040><c> map</c><00:09:16.400><c> them</c><00:09:16.640><c> all</c><00:09:16.880><c> out.</c><00:09:17.160><c> So,</c>

00:09:17.310 --> 00:09:17.320 align:start position:0%
dimensions to map them all out. So,
 

00:09:17.320 --> 00:09:19.150 align:start position:0%
dimensions to map them all out. So,
doing<00:09:17.640><c> evolution</c><00:09:18.080><c> strategy</c><00:09:18.440><c> optimization</c>

00:09:19.150 --> 00:09:19.160 align:start position:0%
doing evolution strategy optimization
 

00:09:19.160 --> 00:09:21.030 align:start position:0%
doing evolution strategy optimization
directly<00:09:19.800><c> looked</c><00:09:20.240><c> infeasible</c>

00:09:21.030 --> 00:09:21.040 align:start position:0%
directly looked infeasible
 

00:09:21.040 --> 00:09:23.190 align:start position:0%
directly looked infeasible
computationally,<00:09:22.040><c> and</c><00:09:22.240><c> most</c><00:09:22.560><c> prior</c><00:09:22.880><c> work</c>

00:09:23.190 --> 00:09:23.200 align:start position:0%
computationally, and most prior work
 

00:09:23.200 --> 00:09:25.070 align:start position:0%
computationally, and most prior work
tried<00:09:23.560><c> to</c><00:09:23.680><c> avoid</c><00:09:24.000><c> the</c><00:09:24.080><c> problem</c><00:09:24.480><c> by</c><00:09:24.600><c> shrinking</c>

00:09:25.070 --> 00:09:25.080 align:start position:0%
tried to avoid the problem by shrinking
 

00:09:25.080 --> 00:09:26.630 align:start position:0%
tried to avoid the problem by shrinking
the<00:09:25.160><c> search</c><00:09:25.440><c> space</c><00:09:25.880><c> or</c><00:09:26.080><c> reducing</c><00:09:26.560><c> the</c>

00:09:26.630 --> 00:09:26.640 align:start position:0%
the search space or reducing the
 

00:09:26.640 --> 00:09:28.110 align:start position:0%
the search space or reducing the
dimensions.<00:09:27.480><c> What's</c><00:09:27.640><c> even</c><00:09:27.880><c> more</c>

00:09:28.110 --> 00:09:28.120 align:start position:0%
dimensions. What's even more
 

00:09:28.120 --> 00:09:30.070 align:start position:0%
dimensions. What's even more
jaw-dropping<00:09:28.760><c> is</c><00:09:28.880><c> that</c><00:09:29.160><c> all</c><00:09:29.320><c> prior</c><00:09:29.600><c> works</c><00:09:29.920><c> are</c>

00:09:30.070 --> 00:09:30.080 align:start position:0%
jaw-dropping is that all prior works are
 

00:09:30.080 --> 00:09:32.390 align:start position:0%
jaw-dropping is that all prior works are
perturbing<00:09:30.640><c> from</c><00:09:30.800><c> a</c><00:09:30.880><c> population</c><00:09:31.680><c> with</c><00:09:32.000><c> tens</c>

00:09:32.390 --> 00:09:32.400 align:start position:0%
perturbing from a population with tens
 

00:09:32.400 --> 00:09:34.310 align:start position:0%
perturbing from a population with tens
of<00:09:32.560><c> thousands</c><00:09:33.000><c> of</c><00:09:33.120><c> models.</c><00:09:33.680><c> But</c><00:09:33.840><c> this</c><00:09:34.000><c> paper</c>

00:09:34.310 --> 00:09:34.320 align:start position:0%
of thousands of models. But this paper
 

00:09:34.320 --> 00:09:37.190 align:start position:0%
of thousands of models. But this paper
only<00:09:34.640><c> used</c><00:09:34.920><c> a</c><00:09:35.000><c> population</c><00:09:35.680><c> size</c><00:09:36.120><c> of</c><00:09:36.360><c> just</c><00:09:36.800><c> 30</c>

00:09:37.190 --> 00:09:37.200 align:start position:0%
only used a population size of just 30
 

00:09:37.200 --> 00:09:39.510 align:start position:0%
only used a population size of just 30
models<00:09:37.720><c> achieve</c><00:09:38.080><c> competitive</c><00:09:38.800><c> performance.</c>

00:09:39.510 --> 00:09:39.520 align:start position:0%
models achieve competitive performance.
 

00:09:39.520 --> 00:09:41.270 align:start position:0%
models achieve competitive performance.
This<00:09:39.640><c> is</c><00:09:39.720><c> like</c><00:09:39.880><c> a</c><00:09:39.960><c> 300</c><00:09:40.560><c> times</c><00:09:40.840><c> compute</c>

00:09:41.270 --> 00:09:41.280 align:start position:0%
This is like a 300 times compute
 

00:09:41.280 --> 00:09:43.070 align:start position:0%
This is like a 300 times compute
reduction.<00:09:41.880><c> But</c><00:09:41.960><c> the</c><00:09:42.040><c> reason</c><00:09:42.400><c> why</c><00:09:42.560><c> this</c><00:09:42.800><c> works</c>

00:09:43.070 --> 00:09:43.080 align:start position:0%
reduction. But the reason why this works
 

00:09:43.080 --> 00:09:45.190 align:start position:0%
reduction. But the reason why this works
with<00:09:43.280><c> only</c><00:09:43.560><c> a</c><00:09:43.600><c> population</c><00:09:44.200><c> of</c><00:09:44.360><c> 30</c><00:09:44.760><c> is</c><00:09:44.960><c> that</c>

00:09:45.190 --> 00:09:45.200 align:start position:0%
with only a population of 30 is that
 

00:09:45.200 --> 00:09:46.830 align:start position:0%
with only a population of 30 is that
even<00:09:45.440><c> though</c><00:09:45.560><c> the</c><00:09:45.680><c> model</c><00:09:46.040><c> has</c><00:09:46.320><c> billions</c><00:09:46.720><c> of</c>

00:09:46.830 --> 00:09:46.840 align:start position:0%
even though the model has billions of
 

00:09:46.840 --> 00:09:48.670 align:start position:0%
even though the model has billions of
parameters,<00:09:47.560><c> the</c><00:09:47.720><c> useful</c><00:09:48.080><c> directions</c><00:09:48.520><c> for</c>

00:09:48.670 --> 00:09:48.680 align:start position:0%
parameters, the useful directions for
 

00:09:48.680 --> 00:09:50.350 align:start position:0%
parameters, the useful directions for
improvement<00:09:49.400><c> are</c><00:09:49.600><c> in</c><00:09:49.800><c> much</c><00:09:50.080><c> lower</c>

00:09:50.350 --> 00:09:50.360 align:start position:0%
improvement are in much lower
 

00:09:50.360 --> 00:09:52.070 align:start position:0%
improvement are in much lower
dimensions.<00:09:51.080><c> Think</c><00:09:51.280><c> of</c><00:09:51.400><c> it</c><00:09:51.520><c> like</c><00:09:51.720><c> this.</c>

00:09:52.070 --> 00:09:52.080 align:start position:0%
dimensions. Think of it like this.
 

00:09:52.080 --> 00:09:53.390 align:start position:0%
dimensions. Think of it like this.
Imagine<00:09:52.440><c> you</c><00:09:52.560><c> are</c><00:09:52.640><c> standing</c><00:09:52.960><c> on</c><00:09:53.040><c> a</c><00:09:53.120><c> huge</c>

00:09:53.390 --> 00:09:53.400 align:start position:0%
Imagine you are standing on a huge
 

00:09:53.400 --> 00:09:54.910 align:start position:0%
Imagine you are standing on a huge
mountain<00:09:53.840><c> with</c><00:09:54.000><c> billions</c><00:09:54.360><c> of</c><00:09:54.480><c> possible</c>

00:09:54.910 --> 00:09:54.920 align:start position:0%
mountain with billions of possible
 

00:09:54.920 --> 00:09:56.830 align:start position:0%
mountain with billions of possible
directions<00:09:55.520><c> you</c><00:09:55.640><c> could</c><00:09:55.800><c> step</c><00:09:56.160><c> in.</c><00:09:56.520><c> But</c><00:09:56.640><c> in</c>

00:09:56.830 --> 00:09:56.840 align:start position:0%
directions you could step in. But in
 

00:09:56.840 --> 00:09:58.630 align:start position:0%
directions you could step in. But in
reality,<00:09:57.680><c> only</c><00:09:57.920><c> a</c><00:09:57.960><c> small</c><00:09:58.240><c> number</c><00:09:58.520><c> of</c>

00:09:58.630 --> 00:09:58.640 align:start position:0%
reality, only a small number of
 

00:09:58.640 --> 00:10:01.190 align:start position:0%
reality, only a small number of
directions<00:09:59.280><c> actually</c><00:09:59.720><c> lead</c><00:10:00.200><c> uphill</c><00:10:00.720><c> as</c><00:10:00.920><c> most</c>

00:10:01.190 --> 00:10:01.200 align:start position:0%
directions actually lead uphill as most
 

00:10:01.200 --> 00:10:03.190 align:start position:0%
directions actually lead uphill as most
directions<00:10:01.760><c> are</c><00:10:01.880><c> either</c><00:10:02.120><c> flat</c><00:10:02.480><c> or</c><00:10:02.680><c> clearly</c>

00:10:03.190 --> 00:10:03.200 align:start position:0%
directions are either flat or clearly
 

00:10:03.200 --> 00:10:05.270 align:start position:0%
directions are either flat or clearly
downhill.<00:10:03.800><c> So,</c><00:10:03.920><c> if</c><00:10:04.080><c> you</c><00:10:04.200><c> randomly</c><00:10:04.600><c> try</c><00:10:04.840><c> 30</c>

00:10:05.270 --> 00:10:05.280 align:start position:0%
downhill. So, if you randomly try 30
 

00:10:05.280 --> 00:10:07.230 align:start position:0%
downhill. So, if you randomly try 30
small<00:10:05.600><c> steps</c><00:10:06.000><c> in</c><00:10:06.160><c> different</c><00:10:06.440><c> directions,</c><00:10:07.160><c> a</c>

00:10:07.230 --> 00:10:07.240 align:start position:0%
small steps in different directions, a
 

00:10:07.240 --> 00:10:09.390 align:start position:0%
small steps in different directions, a
few<00:10:07.440><c> of</c><00:10:07.560><c> them</c><00:10:07.920><c> will</c><00:10:08.080><c> likely</c><00:10:08.520><c> tilt</c><00:10:08.960><c> slightly</c>

00:10:09.390 --> 00:10:09.400 align:start position:0%
few of them will likely tilt slightly
 

00:10:09.400 --> 00:10:11.430 align:start position:0%
few of them will likely tilt slightly
uphill.<00:10:10.000><c> And</c><00:10:10.120><c> when</c><00:10:10.280><c> you</c><00:10:10.440><c> average</c><00:10:10.840><c> those,</c><00:10:11.320><c> the</c>

00:10:11.430 --> 00:10:11.440 align:start position:0%
uphill. And when you average those, the
 

00:10:11.440 --> 00:10:13.470 align:start position:0%
uphill. And when you average those, the
downhill<00:10:11.880><c> noise</c><00:10:12.200><c> cancels</c><00:10:12.840><c> out</c><00:10:13.080><c> and</c><00:10:13.280><c> the</c>

00:10:13.470 --> 00:10:13.480 align:start position:0%
downhill noise cancels out and the
 

00:10:13.480 --> 00:10:15.790 align:start position:0%
downhill noise cancels out and the
uphill<00:10:13.840><c> signals</c><00:10:14.480><c> reinforces.</c><00:10:15.440><c> And</c><00:10:15.560><c> this</c><00:10:15.720><c> is</c>

00:10:15.790 --> 00:10:15.800 align:start position:0%
uphill signals reinforces. And this is
 

00:10:15.800 --> 00:10:17.270 align:start position:0%
uphill signals reinforces. And this is
thanks<00:10:16.080><c> to</c><00:10:16.200><c> the</c><00:10:16.280><c> special</c><00:10:16.680><c> attributes</c><00:10:17.120><c> of</c>

00:10:17.270 --> 00:10:17.280 align:start position:0%
thanks to the special attributes of
 

00:10:17.280 --> 00:10:19.230 align:start position:0%
thanks to the special attributes of
extremely<00:10:17.840><c> large</c><00:10:18.160><c> neural</c><00:10:18.440><c> networks</c><00:10:18.960><c> because</c>

00:10:19.230 --> 00:10:19.240 align:start position:0%
extremely large neural networks because
 

00:10:19.240 --> 00:10:21.230 align:start position:0%
extremely large neural networks because
one,<00:10:19.680><c> they</c><00:10:19.840><c> behave</c><00:10:20.200><c> more</c><00:10:20.400><c> smoothly</c><00:10:20.960><c> than</c>

00:10:21.230 --> 00:10:21.240 align:start position:0%
one, they behave more smoothly than
 

00:10:21.240 --> 00:10:22.910 align:start position:0%
one, they behave more smoothly than
people<00:10:21.520><c> expect.</c><00:10:21.960><c> So,</c><00:10:22.080><c> when</c><00:10:22.240><c> you</c><00:10:22.360><c> are</c><00:10:22.600><c> only</c>

00:10:22.910 --> 00:10:22.920 align:start position:0%
people expect. So, when you are only
 

00:10:22.920 --> 00:10:24.790 align:start position:0%
people expect. So, when you are only
adding<00:10:23.280><c> a</c><00:10:23.320><c> very</c><00:10:23.600><c> small</c><00:10:23.880><c> Gaussian</c><00:10:24.360><c> noise,</c>

00:10:24.790 --> 00:10:24.800 align:start position:0%
adding a very small Gaussian noise,
 

00:10:24.800 --> 00:10:26.630 align:start position:0%
adding a very small Gaussian noise,
you're<00:10:25.040><c> actually</c><00:10:25.320><c> not</c><00:10:25.720><c> jumping</c><00:10:26.080><c> around,</c><00:10:26.480><c> but</c>

00:10:26.630 --> 00:10:26.640 align:start position:0%
you're actually not jumping around, but
 

00:10:26.640 --> 00:10:28.790 align:start position:0%
you're actually not jumping around, but
are<00:10:26.760><c> basically</c><00:10:27.240><c> sampling</c><00:10:27.720><c> in</c><00:10:27.840><c> a</c><00:10:27.920><c> local</c><00:10:28.400><c> region</c>

00:10:28.790 --> 00:10:28.800 align:start position:0%
are basically sampling in a local region
 

00:10:28.800 --> 00:10:30.390 align:start position:0%
are basically sampling in a local region
defined<00:10:29.160><c> by</c><00:10:29.280><c> the</c><00:10:29.400><c> Gaussian</c><00:10:29.840><c> noise,</c><00:10:30.160><c> which</c>

00:10:30.390 --> 00:10:30.400 align:start position:0%
defined by the Gaussian noise, which
 

00:10:30.400 --> 00:10:32.110 align:start position:0%
defined by the Gaussian noise, which
maps<00:10:30.760><c> out</c><00:10:30.920><c> the</c><00:10:31.000><c> surroundings.</c><00:10:31.720><c> Therefore,</c>

00:10:32.110 --> 00:10:32.120 align:start position:0%
maps out the surroundings. Therefore,
 

00:10:32.120 --> 00:10:33.750 align:start position:0%
maps out the surroundings. Therefore,
you<00:10:32.240><c> can</c><00:10:32.440><c> find</c><00:10:32.640><c> the</c><00:10:32.720><c> uphill</c><00:10:33.000><c> directions</c><00:10:33.520><c> very</c>

00:10:33.750 --> 00:10:33.760 align:start position:0%
you can find the uphill directions very
 

00:10:33.760 --> 00:10:35.870 align:start position:0%
you can find the uphill directions very
easily.<00:10:34.160><c> And</c><00:10:34.320><c> second,</c><00:10:34.880><c> the</c><00:10:35.000><c> reward</c><00:10:35.360><c> signal</c><00:10:35.720><c> in</c>

00:10:35.870 --> 00:10:35.880 align:start position:0%
easily. And second, the reward signal in
 

00:10:35.880 --> 00:10:38.190 align:start position:0%
easily. And second, the reward signal in
RL-style<00:10:36.480><c> fine-tuning</c><00:10:37.080><c> is</c><00:10:37.240><c> very</c><00:10:37.520><c> coarse.</c><00:10:38.040><c> You</c>

00:10:38.190 --> 00:10:38.200 align:start position:0%
RL-style fine-tuning is very coarse. You
 

00:10:38.200 --> 00:10:40.190 align:start position:0%
RL-style fine-tuning is very coarse. You
are<00:10:38.360><c> not</c><00:10:38.640><c> trying</c><00:10:39.000><c> to</c><00:10:39.120><c> fine-tune</c><00:10:39.560><c> every</c><00:10:39.840><c> token</c>

00:10:40.190 --> 00:10:40.200 align:start position:0%
are not trying to fine-tune every token
 

00:10:40.200 --> 00:10:42.430 align:start position:0%
are not trying to fine-tune every token
perfectly.<00:10:41.120><c> What</c><00:10:41.320><c> you</c><00:10:41.440><c> are</c><00:10:41.560><c> doing</c><00:10:41.840><c> instead</c><00:10:42.240><c> is</c>

00:10:42.430 --> 00:10:42.440 align:start position:0%
perfectly. What you are doing instead is
 

00:10:42.440 --> 00:10:44.310 align:start position:0%
perfectly. What you are doing instead is
trying<00:10:42.760><c> to</c><00:10:42.880><c> move</c><00:10:43.120><c> the</c><00:10:43.240><c> model</c><00:10:43.560><c> in</c><00:10:43.720><c> a</c><00:10:43.760><c> direction</c>

00:10:44.310 --> 00:10:44.320 align:start position:0%
trying to move the model in a direction
 

00:10:44.320 --> 00:10:46.830 align:start position:0%
trying to move the model in a direction
that<00:10:44.520><c> increases</c><00:10:45.280><c> overall</c><00:10:45.800><c> outcome</c><00:10:46.200><c> quality.</c>

00:10:46.830 --> 00:10:46.840 align:start position:0%
that increases overall outcome quality.
 

00:10:46.840 --> 00:10:48.990 align:start position:0%
that increases overall outcome quality.
So,<00:10:47.000><c> that</c><00:10:47.200><c> global</c><00:10:47.520><c> signal</c><00:10:47.960><c> is</c><00:10:48.200><c> often</c><00:10:48.600><c> aligned</c>

00:10:48.990 --> 00:10:49.000 align:start position:0%
So, that global signal is often aligned
 

00:10:49.000 --> 00:10:50.990 align:start position:0%
So, that global signal is often aligned
across<00:10:49.400><c> many</c><00:10:49.640><c> parameters,</c><00:10:50.360><c> which</c><00:10:50.600><c> means</c><00:10:50.880><c> when</c>

00:10:50.990 --> 00:10:51.000 align:start position:0%
across many parameters, which means when
 

00:10:51.000 --> 00:10:53.110 align:start position:0%
across many parameters, which means when
a<00:10:51.040><c> perturbation</c><00:10:51.640><c> improves</c><00:10:52.160><c> performance,</c><00:10:52.920><c> it</c>

00:10:53.110 --> 00:10:53.120 align:start position:0%
a perturbation improves performance, it
 

00:10:53.120 --> 00:10:55.350 align:start position:0%
a perturbation improves performance, it
tends<00:10:53.400><c> to</c><00:10:53.520><c> do</c><00:10:53.720><c> so</c><00:10:53.960><c> in</c><00:10:54.080><c> a</c><00:10:54.160><c> coordinated</c><00:10:55.080><c> way.</c><00:10:55.160><c> So,</c>

00:10:55.350 --> 00:10:55.360 align:start position:0%
tends to do so in a coordinated way. So,
 

00:10:55.360 --> 00:10:57.430 align:start position:0%
tends to do so in a coordinated way. So,
the<00:10:55.480><c> signal</c><00:10:55.840><c> shows</c><00:10:56.160><c> up</c><00:10:56.360><c> clearly</c><00:10:56.840><c> even</c><00:10:57.200><c> with</c><00:10:57.400><c> a</c>

00:10:57.430 --> 00:10:57.440 align:start position:0%
the signal shows up clearly even with a
 

00:10:57.440 --> 00:10:59.550 align:start position:0%
the signal shows up clearly even with a
small<00:10:57.720><c> population.</c><00:10:58.400><c> To</c><00:10:58.520><c> sum</c><00:10:58.800><c> that</c><00:10:59.080><c> up,</c><00:10:59.400><c> the</c>

00:10:59.550 --> 00:10:59.560 align:start position:0%
small population. To sum that up, the
 

00:10:59.560 --> 00:11:02.110 align:start position:0%
small population. To sum that up, the
key<00:10:59.880><c> idea</c><00:11:00.440><c> is</c><00:11:00.960><c> you</c><00:11:01.120><c> don't</c><00:11:01.400><c> need</c><00:11:01.640><c> to</c><00:11:01.760><c> explore</c>

00:11:02.110 --> 00:11:02.120 align:start position:0%
key idea is you don't need to explore
 

00:11:02.120 --> 00:11:04.270 align:start position:0%
key idea is you don't need to explore
the<00:11:02.320><c> entire</c><00:11:02.920><c> billion-dimensional</c><00:11:03.839><c> space.</c>

00:11:04.270 --> 00:11:04.280 align:start position:0%
the entire billion-dimensional space.
 

00:11:04.280 --> 00:11:06.070 align:start position:0%
the entire billion-dimensional space.
You<00:11:04.440><c> only</c><00:11:04.680><c> need</c><00:11:04.960><c> enough</c><00:11:05.200><c> random</c><00:11:05.480><c> directions</c>

00:11:06.070 --> 00:11:06.080 align:start position:0%
You only need enough random directions
 

00:11:06.080 --> 00:11:08.230 align:start position:0%
You only need enough random directions
to<00:11:06.240><c> estimate</c><00:11:06.680><c> the</c><00:11:06.800><c> local</c><00:11:07.320><c> uphill</c><00:11:07.680><c> direction,</c>

00:11:08.230 --> 00:11:08.240 align:start position:0%
to estimate the local uphill direction,
 

00:11:08.240 --> 00:11:10.070 align:start position:0%
to estimate the local uphill direction,
which<00:11:08.440><c> makes</c><00:11:08.720><c> evolution</c><00:11:09.120><c> strategies</c><00:11:09.680><c> a</c><00:11:09.760><c> lot</c>

00:11:10.070 --> 00:11:10.080 align:start position:0%
which makes evolution strategies a lot
 

00:11:10.080 --> 00:11:11.470 align:start position:0%
which makes evolution strategies a lot
more<00:11:10.240><c> feasible</c><00:11:10.800><c> as</c><00:11:11.000><c> it</c><00:11:11.120><c> is</c><00:11:11.280><c> now</c>

00:11:11.470 --> 00:11:11.480 align:start position:0%
more feasible as it is now
 

00:11:11.480 --> 00:11:13.710 align:start position:0%
more feasible as it is now
memory-efficient<00:11:12.360><c> and</c><00:11:12.520><c> can</c><00:11:12.720><c> be</c><00:11:12.880><c> parallelized</c>

00:11:13.710 --> 00:11:13.720 align:start position:0%
memory-efficient and can be parallelized
 

00:11:13.720 --> 00:11:15.990 align:start position:0%
memory-efficient and can be parallelized
across<00:11:14.120><c> GPUs</c><00:11:14.839><c> while</c><00:11:15.080><c> still</c><00:11:15.360><c> only</c><00:11:15.600><c> require</c>

00:11:15.990 --> 00:11:16.000 align:start position:0%
across GPUs while still only require
 

00:11:16.000 --> 00:11:17.590 align:start position:0%
across GPUs while still only require
inference<00:11:16.600><c> as</c><00:11:16.760><c> it</c><00:11:16.880><c> does</c><00:11:17.080><c> not</c><00:11:17.280><c> require</c>

00:11:17.590 --> 00:11:17.600 align:start position:0%
inference as it does not require
 

00:11:17.600 --> 00:11:19.750 align:start position:0%
inference as it does not require
back-propagation.<00:11:18.720><c> Crazy,</c><00:11:19.200><c> right?</c><00:11:19.600><c> But,</c>

00:11:19.750 --> 00:11:19.760 align:start position:0%
back-propagation. Crazy, right? But,
 

00:11:19.760 --> 00:11:21.030 align:start position:0%
back-propagation. Crazy, right? But,
even<00:11:19.960><c> though</c><00:11:20.080><c> this</c><00:11:20.360><c> makes</c><00:11:20.600><c> evolution</c>

00:11:21.030 --> 00:11:21.040 align:start position:0%
even though this makes evolution
 

00:11:21.040 --> 00:11:22.990 align:start position:0%
even though this makes evolution
strategies<00:11:21.600><c> statistically</c><00:11:22.400><c> feasible</c><00:11:22.839><c> with</c><00:11:22.960><c> a</c>

00:11:22.990 --> 00:11:23.000 align:start position:0%
strategies statistically feasible with a
 

00:11:23.000 --> 00:11:25.390 align:start position:0%
strategies statistically feasible with a
population<00:11:23.560><c> of</c><00:11:23.680><c> 30,</c><00:11:24.200><c> there</c><00:11:24.440><c> is</c><00:11:24.600><c> still</c>

00:11:25.390 --> 00:11:25.400 align:start position:0%
population of 30, there is still
 

00:11:25.400 --> 00:11:27.110 align:start position:0%
population of 30, there is still
practical<00:11:25.839><c> problem.</c><00:11:26.280><c> The</c><00:11:26.400><c> method</c><00:11:26.760><c> so</c><00:11:26.920><c> far</c>

00:11:27.110 --> 00:11:27.120 align:start position:0%
practical problem. The method so far
 

00:11:27.120 --> 00:11:29.510 align:start position:0%
practical problem. The method so far
still<00:11:27.400><c> requires</c><00:11:27.880><c> you</c><00:11:28.000><c> to</c><00:11:28.240><c> run</c><00:11:28.560><c> 30</c><00:11:29.120><c> full</c>

00:11:29.510 --> 00:11:29.520 align:start position:0%
still requires you to run 30 full
 

00:11:29.520 --> 00:11:31.430 align:start position:0%
still requires you to run 30 full
forward<00:11:29.880><c> passes</c><00:11:30.360><c> of</c><00:11:30.480><c> a</c><00:11:30.560><c> billion</c><00:11:30.960><c> parameter</c>

00:11:31.430 --> 00:11:31.440 align:start position:0%
forward passes of a billion parameter
 

00:11:31.440 --> 00:11:33.550 align:start position:0%
forward passes of a billion parameter
model<00:11:31.800><c> for</c><00:11:32.160><c> every</c><00:11:32.480><c> update.</c><00:11:32.960><c> And</c><00:11:33.080><c> not</c><00:11:33.240><c> just</c>

00:11:33.550 --> 00:11:33.560 align:start position:0%
model for every update. And not just
 

00:11:33.560 --> 00:11:35.750 align:start position:0%
model for every update. And not just
once,<00:11:34.000><c> you</c><00:11:34.200><c> need</c><00:11:34.400><c> to</c><00:11:34.480><c> do</c><00:11:34.640><c> this</c><00:11:34.880><c> over</c><00:11:35.240><c> and</c><00:11:35.440><c> over</c>

00:11:35.750 --> 00:11:35.760 align:start position:0%
once, you need to do this over and over
 

00:11:35.760 --> 00:11:37.790 align:start position:0%
once, you need to do this over and over
again<00:11:36.160><c> for</c><00:11:36.360><c> however</c><00:11:36.800><c> many</c><00:11:37.120><c> iterations</c><00:11:37.680><c> you</c>

00:11:37.790 --> 00:11:37.800 align:start position:0%
again for however many iterations you
 

00:11:37.800 --> 00:11:40.230 align:start position:0%
again for however many iterations you
set.<00:11:38.240><c> At</c><00:11:38.360><c> the</c><00:11:38.560><c> LM</c><00:11:38.880><c> scale,</c><00:11:39.440><c> doing</c><00:11:39.760><c> this</c><00:11:39.960><c> much</c>

00:11:40.230 --> 00:11:40.240 align:start position:0%
set. At the LM scale, doing this much
 

00:11:40.240 --> 00:11:42.590 align:start position:0%
set. At the LM scale, doing this much
forward<00:11:40.560><c> passes</c><00:11:41.240><c> is</c><00:11:41.480><c> extremely</c><00:11:42.000><c> expensive</c>

00:11:42.590 --> 00:11:42.600 align:start position:0%
forward passes is extremely expensive
 

00:11:42.600 --> 00:11:44.190 align:start position:0%
forward passes is extremely expensive
because<00:11:42.880><c> compared</c><00:11:43.400><c> to</c><00:11:43.560><c> standard</c><00:11:43.880><c> gradient</c>

00:11:44.190 --> 00:11:44.200 align:start position:0%
because compared to standard gradient
 

00:11:44.200 --> 00:11:46.150 align:start position:0%
because compared to standard gradient
training,<00:11:44.680><c> which</c><00:11:44.880><c> does</c><00:11:45.120><c> one</c><00:11:45.440><c> forward</c><00:11:45.680><c> and</c><00:11:45.880><c> one</c>

00:11:46.150 --> 00:11:46.160 align:start position:0%
training, which does one forward and one
 

00:11:46.160 --> 00:11:48.230 align:start position:0%
training, which does one forward and one
backward<00:11:46.600><c> pass,</c><00:11:47.080><c> this</c><00:11:47.280><c> can</c><00:11:47.440><c> be</c><00:11:47.600><c> slower</c><00:11:48.040><c> or</c>

00:11:48.230 --> 00:11:48.240 align:start position:0%
backward pass, this can be slower or
 

00:11:48.240 --> 00:11:50.310 align:start position:0%
backward pass, this can be slower or
more<00:11:48.520><c> costly</c><00:11:49.040><c> depending</c><00:11:49.440><c> on</c><00:11:49.520><c> the</c><00:11:49.600><c> setup.</c><00:11:50.160><c> So,</c>

00:11:50.310 --> 00:11:50.320 align:start position:0%
more costly depending on the setup. So,
 

00:11:50.320 --> 00:11:52.150 align:start position:0%
more costly depending on the setup. So,
this<00:11:50.480><c> is</c><00:11:50.560><c> where</c><00:11:50.680><c> the</c><00:11:50.800><c> next</c><00:11:51.040><c> paper,</c><00:11:51.520><c> Agro,</c>

00:11:52.150 --> 00:11:52.160 align:start position:0%
this is where the next paper, Agro,
 

00:11:52.160 --> 00:11:53.870 align:start position:0%
this is where the next paper, Agro,
short<00:11:52.520><c> for</c><00:11:52.760><c> evolution</c><00:11:53.120><c> strategies</c><00:11:53.720><c> at</c>

00:11:53.870 --> 00:11:53.880 align:start position:0%
short for evolution strategies at
 

00:11:53.880 --> 00:11:56.550 align:start position:0%
short for evolution strategies at
hyperscale,<00:11:54.640><c> published</c><00:11:55.120><c> in</c><00:11:55.240><c> November</c><00:11:55.760><c> 2025</c>

00:11:56.550 --> 00:11:56.560 align:start position:0%
hyperscale, published in November 2025
 

00:11:56.560 --> 00:11:58.950 align:start position:0%
hyperscale, published in November 2025
comes<00:11:56.880><c> in.</c><00:11:57.240><c> Agro</c><00:11:57.720><c> addresses</c><00:11:58.280><c> the</c><00:11:58.400><c> systems</c>

00:11:58.950 --> 00:11:58.960 align:start position:0%
comes in. Agro addresses the systems
 

00:11:58.960 --> 00:12:01.030 align:start position:0%
comes in. Agro addresses the systems
bottleneck<00:11:59.520><c> of</c><00:11:59.839><c> evolution</c><00:12:00.320><c> strategies.</c><00:12:00.920><c> The</c>

00:12:01.030 --> 00:12:01.040 align:start position:0%
bottleneck of evolution strategies. The
 

00:12:01.040 --> 00:12:02.590 align:start position:0%
bottleneck of evolution strategies. The
core<00:12:01.320><c> idea</c><00:12:01.640><c> is</c><00:12:01.760><c> simple.</c><00:12:02.200><c> Instead</c><00:12:02.480><c> of</c>

00:12:02.590 --> 00:12:02.600 align:start position:0%
core idea is simple. Instead of
 

00:12:02.600 --> 00:12:04.550 align:start position:0%
core idea is simple. Instead of
perturbing<00:12:03.160><c> the</c><00:12:03.320><c> entire</c><00:12:03.720><c> weight</c><00:12:03.960><c> matrix</c><00:12:04.400><c> in</c><00:12:04.520><c> a</c>

00:12:04.550 --> 00:12:04.560 align:start position:0%
perturbing the entire weight matrix in a
 

00:12:04.560 --> 00:12:06.430 align:start position:0%
perturbing the entire weight matrix in a
full<00:12:04.920><c> random</c><00:12:05.280><c> way,</c><00:12:05.640><c> they</c><00:12:05.880><c> structure</c><00:12:06.320><c> the</c>

00:12:06.430 --> 00:12:06.440 align:start position:0%
full random way, they structure the
 

00:12:06.440 --> 00:12:08.950 align:start position:0%
full random way, they structure the
perturbations<00:12:07.320><c> as</c><00:12:07.640><c> LoRa</c><00:12:08.040><c> updates.</c><00:12:08.640><c> So,</c><00:12:08.760><c> by</c>

00:12:08.950 --> 00:12:08.960 align:start position:0%
perturbations as LoRa updates. So, by
 

00:12:08.960 --> 00:12:11.350 align:start position:0%
perturbations as LoRa updates. So, by
making<00:12:09.400><c> perturbations</c><00:12:10.120><c> low</c><00:12:10.400><c> rank,</c><00:12:11.000><c> you</c><00:12:11.160><c> can</c>

00:12:11.350 --> 00:12:11.360 align:start position:0%
making perturbations low rank, you can
 

00:12:11.360 --> 00:12:13.510 align:start position:0%
making perturbations low rank, you can
bash<00:12:11.680><c> them</c><00:12:11.920><c> like</c><00:12:12.160><c> LoRa</c><00:12:12.560><c> adapters.</c><00:12:13.280><c> You</c>

00:12:13.510 --> 00:12:13.520 align:start position:0%
bash them like LoRa adapters. You
 

00:12:13.520 --> 00:12:15.230 align:start position:0%
bash them like LoRa adapters. You
basically<00:12:13.920><c> reuse</c><00:12:14.400><c> most</c><00:12:14.600><c> of</c><00:12:14.680><c> the</c><00:12:14.800><c> original</c>

00:12:15.230 --> 00:12:15.240 align:start position:0%
basically reuse most of the original
 

00:12:15.240 --> 00:12:17.390 align:start position:0%
basically reuse most of the original
computation<00:12:15.960><c> and</c><00:12:16.200><c> only</c><00:12:16.480><c> swap</c><00:12:16.760><c> the</c><00:12:16.880><c> LoRa</c><00:12:17.240><c> to</c>

00:12:17.390 --> 00:12:17.400 align:start position:0%
computation and only swap the LoRa to
 

00:12:17.400 --> 00:12:19.390 align:start position:0%
computation and only swap the LoRa to
evaluate<00:12:18.040><c> the</c><00:12:18.120><c> perturbations,</c><00:12:18.960><c> which</c><00:12:19.160><c> means</c>

00:12:19.390 --> 00:12:19.400 align:start position:0%
evaluate the perturbations, which means
 

00:12:19.400 --> 00:12:21.910 align:start position:0%
evaluate the perturbations, which means
instead<00:12:19.800><c> of</c><00:12:19.920><c> paying</c><00:12:20.360><c> the</c><00:12:20.480><c> full</c><00:12:20.800><c> cost</c><00:12:21.240><c> of</c><00:12:21.480><c> 30</c>

00:12:21.910 --> 00:12:21.920 align:start position:0%
instead of paying the full cost of 30
 

00:12:21.920 --> 00:12:24.070 align:start position:0%
instead of paying the full cost of 30
completely<00:12:22.480><c> separate</c><00:12:22.920><c> forward</c><00:12:23.240><c> passes,</c><00:12:23.920><c> you</c>

00:12:24.070 --> 00:12:24.080 align:start position:0%
completely separate forward passes, you
 

00:12:24.080 --> 00:12:26.230 align:start position:0%
completely separate forward passes, you
can<00:12:24.240><c> compute</c><00:12:24.600><c> just</c><00:12:25.000><c> one</c><00:12:25.200><c> forward</c><00:12:25.520><c> pass</c><00:12:26.080><c> and</c>

00:12:26.230 --> 00:12:26.240 align:start position:0%
can compute just one forward pass and
 

00:12:26.240 --> 00:12:28.750 align:start position:0%
can compute just one forward pass and
swap<00:12:26.560><c> in</c><00:12:26.680><c> different</c><00:12:26.960><c> LoRas.</c><00:12:27.480><c> So,</c><00:12:27.880><c> Agro</c><00:12:28.360><c> makes</c>

00:12:28.750 --> 00:12:28.760 align:start position:0%
swap in different LoRas. So, Agro makes
 

00:12:28.760 --> 00:12:30.710 align:start position:0%
swap in different LoRas. So, Agro makes
evolution<00:12:29.240><c> strategies</c><00:12:29.839><c> a</c><00:12:29.960><c> lot</c><00:12:30.480><c> more</c>

00:12:30.710 --> 00:12:30.720 align:start position:0%
evolution strategies a lot more
 

00:12:30.720 --> 00:12:32.390 align:start position:0%
evolution strategies a lot more
hardware-friendly.<00:12:31.680><c> Another</c><00:12:32.000><c> important</c>

00:12:32.390 --> 00:12:32.400 align:start position:0%
hardware-friendly. Another important
 

00:12:32.400 --> 00:12:33.790 align:start position:0%
hardware-friendly. Another important
thing<00:12:32.600><c> to</c><00:12:32.680><c> note</c><00:12:32.880><c> that</c><00:12:33.160><c> even</c><00:12:33.360><c> though</c><00:12:33.560><c> each</c>

00:12:33.790 --> 00:12:33.800 align:start position:0%
thing to note that even though each
 

00:12:33.800 --> 00:12:35.630 align:start position:0%
thing to note that even though each
perturbation<00:12:34.280><c> is</c><00:12:34.440><c> low</c><00:12:34.640><c> rank,</c><00:12:35.160><c> when</c><00:12:35.400><c> you</c>

00:12:35.630 --> 00:12:35.640 align:start position:0%
perturbation is low rank, when you
 

00:12:35.640 --> 00:12:37.630 align:start position:0%
perturbation is low rank, when you
average<00:12:36.080><c> many</c><00:12:36.320><c> of</c><00:12:36.440><c> them</c><00:12:36.560><c> together,</c><00:12:37.160><c> the</c><00:12:37.320><c> final</c>

00:12:37.630 --> 00:12:37.640 align:start position:0%
average many of them together, the final
 

00:12:37.640 --> 00:12:39.750 align:start position:0%
average many of them together, the final
update<00:12:38.160><c> is</c><00:12:38.320><c> not</c><00:12:38.520><c> actually</c><00:12:38.800><c> restricted</c><00:12:39.400><c> to</c><00:12:39.520><c> low</c>

00:12:39.750 --> 00:12:39.760 align:start position:0%
update is not actually restricted to low
 

00:12:39.760 --> 00:12:41.590 align:start position:0%
update is not actually restricted to low
rank.<00:12:40.160><c> So,</c><00:12:40.440><c> you</c><00:12:40.600><c> still</c><00:12:40.839><c> get</c><00:12:41.000><c> a</c><00:12:41.080><c> rich</c><00:12:41.400><c> and</c>

00:12:41.590 --> 00:12:41.600 align:start position:0%
rank. So, you still get a rich and
 

00:12:41.600 --> 00:12:43.590 align:start position:0%
rank. So, you still get a rich and
high-dimensional<00:12:42.440><c> update,</c><00:12:42.839><c> but</c><00:12:43.000><c> you</c><00:12:43.200><c> compute</c>

00:12:43.590 --> 00:12:43.600 align:start position:0%
high-dimensional update, but you compute
 

00:12:43.600 --> 00:12:45.470 align:start position:0%
high-dimensional update, but you compute
it<00:12:43.800><c> in</c><00:12:44.000><c> a</c><00:12:44.080><c> much</c><00:12:44.480><c> cheaper</c><00:12:44.880><c> way.</c><00:12:45.280><c> And</c><00:12:45.400><c> the</c>

00:12:45.470 --> 00:12:45.480 align:start position:0%
it in a much cheaper way. And the
 

00:12:45.480 --> 00:12:47.030 align:start position:0%
it in a much cheaper way. And the
performance<00:12:46.040><c> is</c><00:12:46.240><c> broadly</c><00:12:46.640><c> similar</c><00:12:46.920><c> to</c>

00:12:47.030 --> 00:12:47.040 align:start position:0%
performance is broadly similar to
 

00:12:47.040 --> 00:12:48.670 align:start position:0%
performance is broadly similar to
standard<00:12:47.480><c> evolution</c><00:12:47.839><c> strategies,</c><00:12:48.440><c> but</c><00:12:48.560><c> the</c>

00:12:48.670 --> 00:12:48.680 align:start position:0%
standard evolution strategies, but the
 

00:12:48.680 --> 00:12:51.710 align:start position:0%
standard evolution strategies, but the
compute<00:12:49.080><c> cost</c><00:12:49.520><c> is</c><00:12:49.800><c> reduced</c><00:12:50.400><c> by</c><00:12:50.600><c> so</c><00:12:51.080><c> much.</c><00:12:51.600><c> So,</c>

00:12:51.710 --> 00:12:51.720 align:start position:0%
compute cost is reduced by so much. So,
 

00:12:51.720 --> 00:12:53.710 align:start position:0%
compute cost is reduced by so much. So,
with<00:12:51.880><c> how</c><00:12:52.120><c> Agro</c><00:12:52.480><c> is</c><00:12:52.640><c> making</c><00:12:52.920><c> models</c><00:12:53.280><c> only</c><00:12:53.520><c> need</c>

00:12:53.710 --> 00:12:53.720 align:start position:0%
with how Agro is making models only need
 

00:12:53.720 --> 00:12:55.990 align:start position:0%
with how Agro is making models only need
to<00:12:53.839><c> run</c><00:12:54.080><c> with</c><00:12:54.320><c> inference</c><00:12:54.760><c> mode</c><00:12:55.400><c> while</c><00:12:55.680><c> keeping</c>

00:12:55.990 --> 00:12:56.000 align:start position:0%
to run with inference mode while keeping
 

00:12:56.000 --> 00:12:57.790 align:start position:0%
to run with inference mode while keeping
performance<00:12:56.480><c> roughly</c><00:12:56.880><c> on</c><00:12:57.080><c> par</c><00:12:57.320><c> with</c><00:12:57.400><c> the</c><00:12:57.480><c> best</c>

00:12:57.790 --> 00:12:57.800 align:start position:0%
performance roughly on par with the best
 

00:12:57.800 --> 00:12:59.790 align:start position:0%
performance roughly on par with the best
evolution<00:12:58.280><c> strategies</c><00:12:58.839><c> baselines,</c><00:12:59.520><c> when</c><00:12:59.680><c> you</c>

00:12:59.790 --> 00:12:59.800 align:start position:0%
evolution strategies baselines, when you
 

00:12:59.800 --> 00:13:02.430 align:start position:0%
evolution strategies baselines, when you
compare<00:13:00.240><c> them</c><00:13:00.520><c> on</c><00:13:00.839><c> raw</c><00:13:01.040><c> training</c><00:13:01.400><c> speed,</c><00:13:02.000><c> Agro</c>

00:13:02.430 --> 00:13:02.440 align:start position:0%
compare them on raw training speed, Agro
 

00:13:02.440 --> 00:13:05.910 align:start position:0%
compare them on raw training speed, Agro
is<00:13:02.760><c> at</c><00:13:02.920><c> around</c><00:13:03.280><c> 91,</c><00:13:04.000><c> PPO</c><00:13:04.520><c> is</c><00:13:04.760><c> at</c><00:13:04.920><c> 34,</c><00:13:05.680><c> and</c>

00:13:05.910 --> 00:13:05.920 align:start position:0%
is at around 91, PPO is at 34, and
 

00:13:05.920 --> 00:13:09.870 align:start position:0%
is at around 91, PPO is at 34, and
OpenES<00:13:06.600><c> is</c><00:13:06.880><c> at</c><00:13:07.080><c> 0.41.</c><00:13:08.400><c> PPO</c><00:13:08.880><c> is</c><00:13:09.120><c> not</c><00:13:09.400><c> slow</c><00:13:09.720><c> in</c>

00:13:09.870 --> 00:13:09.880 align:start position:0%
OpenES is at 0.41. PPO is not slow in
 

00:13:09.880 --> 00:13:11.310 align:start position:0%
OpenES is at 0.41. PPO is not slow in
general,<00:13:10.320><c> but</c><00:13:10.520><c> it's</c><00:13:10.680><c> that</c><00:13:10.920><c> evolution</c>

00:13:11.310 --> 00:13:11.320 align:start position:0%
general, but it's that evolution
 

00:13:11.320 --> 00:13:13.829 align:start position:0%
general, but it's that evolution
strategy<00:13:11.760><c> training</c><00:13:12.160><c> can</c><00:13:12.520><c> be</c><00:13:12.760><c> extremely</c><00:13:13.360><c> fast</c>

00:13:13.829 --> 00:13:13.839 align:start position:0%
strategy training can be extremely fast
 

00:13:13.839 --> 00:13:15.430 align:start position:0%
strategy training can be extremely fast
once<00:13:14.120><c> you</c><00:13:14.280><c> structure</c><00:13:14.680><c> perturbations</c><00:13:15.320><c> to</c>

00:13:15.430 --> 00:13:15.440 align:start position:0%
once you structure perturbations to
 

00:13:15.440 --> 00:13:17.790 align:start position:0%
once you structure perturbations to
match<00:13:15.680><c> GPU</c><00:13:16.080><c> MatMul</c><00:13:16.480><c> hardware.</c><00:13:17.040><c> In</c><00:13:17.200><c> some</c><00:13:17.440><c> LM</c>

00:13:17.790 --> 00:13:17.800 align:start position:0%
match GPU MatMul hardware. In some LM
 

00:13:17.800 --> 00:13:19.750 align:start position:0%
match GPU MatMul hardware. In some LM
settings,<00:13:18.320><c> it</c><00:13:18.520><c> also</c><00:13:18.839><c> beats</c><00:13:19.240><c> popular</c>

00:13:19.750 --> 00:13:19.760 align:start position:0%
settings, it also beats popular
 

00:13:19.760 --> 00:13:21.390 align:start position:0%
settings, it also beats popular
reinforcement<00:13:20.480><c> learning</c><00:13:20.839><c> fine-tuning</c>

00:13:21.390 --> 00:13:21.400 align:start position:0%
reinforcement learning fine-tuning
 

00:13:21.400 --> 00:13:23.630 align:start position:0%
reinforcement learning fine-tuning
methods.<00:13:21.920><c> For</c><00:13:22.080><c> instance,</c><00:13:22.640><c> on</c><00:13:22.839><c> LM</c><00:13:23.280><c> reasoning</c>

00:13:23.630 --> 00:13:23.640 align:start position:0%
methods. For instance, on LM reasoning
 

00:13:23.640 --> 00:13:26.150 align:start position:0%
methods. For instance, on LM reasoning
fine-tuning<00:13:24.200><c> comparisons</c><00:13:25.000><c> against</c><00:13:25.280><c> GRPO,</c>

00:13:26.150 --> 00:13:26.160 align:start position:0%
fine-tuning comparisons against GRPO,
 

00:13:26.160 --> 00:13:28.630 align:start position:0%
fine-tuning comparisons against GRPO,
they<00:13:26.360><c> fine-tuned</c><00:13:26.960><c> RWKV-7</c><00:13:28.040><c> models</c><00:13:28.440><c> on</c>

00:13:28.630 --> 00:13:28.640 align:start position:0%
they fine-tuned RWKV-7 models on
 

00:13:28.640 --> 00:13:31.110 align:start position:0%
they fine-tuned RWKV-7 models on
countdown<00:13:29.240><c> and</c><00:13:29.400><c> GSM8K</c><00:13:30.280><c> and</c><00:13:30.400><c> report</c><00:13:30.839><c> that</c>

00:13:31.110 --> 00:13:31.120 align:start position:0%
countdown and GSM8K and report that
 

00:13:31.120 --> 00:13:33.110 align:start position:0%
countdown and GSM8K and report that
under<00:13:31.400><c> the</c><00:13:31.520><c> same</c><00:13:31.760><c> hardware</c><00:13:32.280><c> and</c><00:13:32.520><c> wall</c><00:13:32.760><c> clock</c>

00:13:33.110 --> 00:13:33.120 align:start position:0%
under the same hardware and wall clock
 

00:13:33.120 --> 00:13:35.790 align:start position:0%
under the same hardware and wall clock
time,<00:13:33.640><c> Agro</c><00:13:34.000><c> reaches</c><00:13:34.400><c> 35%</c><00:13:35.240><c> validation</c>

00:13:35.790 --> 00:13:35.800 align:start position:0%
time, Agro reaches 35% validation
 

00:13:35.800 --> 00:13:38.790 align:start position:0%
time, Agro reaches 35% validation
accuracy<00:13:36.360><c> versus</c><00:13:36.800><c> 23%</c><00:13:37.680><c> for</c><00:13:37.839><c> GRPO</c><00:13:38.560><c> on</c><00:13:38.720><c> the</c>

00:13:38.790 --> 00:13:38.800 align:start position:0%
accuracy versus 23% for GRPO on the
 

00:13:38.800 --> 00:13:40.630 align:start position:0%
accuracy versus 23% for GRPO on the
countdown<00:13:39.240><c> benchmark.</c><00:13:39.920><c> For</c><00:13:40.000><c> the</c><00:13:40.120><c> benchmark</c>

00:13:40.630 --> 00:13:40.640 align:start position:0%
countdown benchmark. For the benchmark
 

00:13:40.640 --> 00:13:45.070 align:start position:0%
countdown benchmark. For the benchmark
GSM8K<00:13:41.560><c> with</c><00:13:41.800><c> RWKV-7</c><00:13:42.960><c> 7B</c><00:13:43.600><c> on</c><00:13:43.880><c> eight</c><00:13:44.040><c> GPUs,</c><00:13:44.920><c> they</c>

00:13:45.070 --> 00:13:45.080 align:start position:0%
GSM8K with RWKV-7 7B on eight GPUs, they
 

00:13:45.080 --> 00:13:48.510 align:start position:0%
GSM8K with RWKV-7 7B on eight GPUs, they
show<00:13:45.320><c> that</c><00:13:45.600><c> Agro</c><00:13:46.080><c> can</c><00:13:46.480><c> run</c><00:13:46.839><c> 8,192</c>

00:13:48.510 --> 00:13:48.520 align:start position:0%
show that Agro can run 8,192
 

00:13:48.520 --> 00:13:51.150 align:start position:0%
show that Agro can run 8,192
parallel<00:13:49.000><c> generations</c><00:13:49.880><c> while</c><00:13:50.240><c> GRPO</c><00:13:50.839><c> runs</c>

00:13:51.150 --> 00:13:51.160 align:start position:0%
parallel generations while GRPO runs
 

00:13:51.160 --> 00:13:54.310 align:start position:0%
parallel generations while GRPO runs
only<00:13:51.520><c> 256.</c><00:13:52.560><c> So,</c><00:13:52.839><c> Agro</c><00:13:53.240><c> can</c><00:13:53.520><c> run</c><00:13:53.800><c> far</c><00:13:54.120><c> more</c>

00:13:54.310 --> 00:13:54.320 align:start position:0%
only 256. So, Agro can run far more
 

00:13:54.320 --> 00:13:56.790 align:start position:0%
only 256. So, Agro can run far more
parallel<00:13:54.760><c> generations</c><00:13:55.520><c> than</c><00:13:55.560><c> GRPO</c><00:13:56.400><c> under</c><00:13:56.680><c> the</c>

00:13:56.790 --> 00:13:56.800 align:start position:0%
parallel generations than GRPO under the
 

00:13:56.800 --> 00:13:59.070 align:start position:0%
parallel generations than GRPO under the
same<00:13:57.040><c> hardware,</c><00:13:57.560><c> which</c><00:13:57.760><c> means</c><00:13:58.120><c> Agro</c><00:13:58.640><c> is</c><00:13:58.880><c> more</c>

00:13:59.070 --> 00:13:59.080 align:start position:0%
same hardware, which means Agro is more
 

00:13:59.080 --> 00:14:01.310 align:start position:0%
same hardware, which means Agro is more
efficient<00:13:59.720><c> in</c><00:14:00.000><c> wall</c><00:14:00.280><c> clock</c><00:14:00.640><c> throughput</c><00:14:01.160><c> and</c>

00:14:01.310 --> 00:14:01.320 align:start position:0%
efficient in wall clock throughput and
 

00:14:01.320 --> 00:14:03.190 align:start position:0%
efficient in wall clock throughput and
memory.<00:14:01.800><c> Another</c><00:14:02.080><c> example</c><00:14:02.560><c> is</c><00:14:02.720><c> like</c><00:14:02.960><c> this</c>

00:14:03.190 --> 00:14:03.200 align:start position:0%
memory. Another example is like this
 

00:14:03.200 --> 00:14:06.310 align:start position:0%
memory. Another example is like this
RWKV-7<00:14:04.320><c> 14B</c><00:14:04.920><c> trained</c><00:14:05.240><c> for</c><00:14:05.400><c> 12</c><00:14:05.760><c> hours</c><00:14:06.120><c> with</c>

00:14:06.310 --> 00:14:06.320 align:start position:0%
RWKV-7 14B trained for 12 hours with
 

00:14:06.320 --> 00:14:09.150 align:start position:0%
RWKV-7 14B trained for 12 hours with
Agro<00:14:06.800><c> on</c><00:14:07.000><c> 32</c><00:14:07.440><c> GPUs.</c><00:14:08.280><c> They</c><00:14:08.480><c> were</c><00:14:08.720><c> able</c><00:14:08.920><c> to</c><00:14:09.000><c> get</c>

00:14:09.150 --> 00:14:09.160 align:start position:0%
Agro on 32 GPUs. They were able to get
 

00:14:09.160 --> 00:14:12.550 align:start position:0%
Agro on 32 GPUs. They were able to get
improvements<00:14:09.800><c> such</c><00:14:10.000><c> as</c><00:14:10.160><c> plus</c><00:14:10.400><c> 17%</c><00:14:11.440><c> on</c><00:14:11.680><c> AM</c><00:14:11.920><c> 24</c>

00:14:12.550 --> 00:14:12.560 align:start position:0%
improvements such as plus 17% on AM 24
 

00:14:12.560 --> 00:14:15.949 align:start position:0%
improvements such as plus 17% on AM 24
and<00:14:12.800><c> plus</c><00:14:13.079><c> 26</c><00:14:13.680><c> on</c><00:14:13.959><c> AM</c><00:14:14.280><c> 25.</c><00:14:15.000><c> They</c><00:14:15.240><c> also</c><00:14:15.520><c> reported</c>

00:14:15.949 --> 00:14:15.959 align:start position:0%
and plus 26 on AM 25. They also reported
 

00:14:15.959 --> 00:14:19.550 align:start position:0%
and plus 26 on AM 25. They also reported
that<00:14:16.240><c> Agro</c><00:14:16.880><c> outperforms</c><00:14:17.520><c> GRPO</c><00:14:18.440><c> on</c><00:14:18.680><c> GSM8K</c>

00:14:19.550 --> 00:14:19.560 align:start position:0%
that Agro outperforms GRPO on GSM8K
 

00:14:19.560 --> 00:14:20.990 align:start position:0%
that Agro outperforms GRPO on GSM8K
fine-tuning.<00:14:20.280><c> While</c><00:14:20.520><c> I</c><00:14:20.560><c> know</c><00:14:20.720><c> I</c><00:14:20.800><c> just</c>

00:14:20.990 --> 00:14:21.000 align:start position:0%
fine-tuning. While I know I just
 

00:14:21.000 --> 00:14:22.550 align:start position:0%
fine-tuning. While I know I just
bombarded<00:14:21.600><c> you</c><00:14:21.720><c> with</c><00:14:21.880><c> a</c><00:14:21.920><c> lot</c><00:14:22.200><c> of</c><00:14:22.320><c> great</c>

00:14:22.550 --> 00:14:22.560 align:start position:0%
bombarded you with a lot of great
 

00:14:22.560 --> 00:14:24.190 align:start position:0%
bombarded you with a lot of great
performance<00:14:23.079><c> reports,</c><00:14:23.720><c> it</c><00:14:23.880><c> does</c><00:14:24.079><c> not</c>

00:14:24.190 --> 00:14:24.200 align:start position:0%
performance reports, it does not
 

00:14:24.200 --> 00:14:26.310 align:start position:0%
performance reports, it does not
necessarily<00:14:24.839><c> mean</c><00:14:25.240><c> Agro</c><00:14:25.760><c> is</c><00:14:25.959><c> just</c><00:14:26.120><c> better</c>

00:14:26.310 --> 00:14:26.320 align:start position:0%
necessarily mean Agro is just better
 

00:14:26.320 --> 00:14:28.190 align:start position:0%
necessarily mean Agro is just better
than<00:14:26.440><c> GRPO.</c><00:14:27.079><c> It's</c><00:14:27.240><c> just</c><00:14:27.440><c> that</c><00:14:27.680><c> Agro's</c>

00:14:28.190 --> 00:14:28.200 align:start position:0%
than GRPO. It's just that Agro's
 

00:14:28.200 --> 00:14:30.670 align:start position:0%
than GRPO. It's just that Agro's
advantage<00:14:28.800><c> so</c><00:14:29.000><c> far</c><00:14:29.320><c> can</c><00:14:29.760><c> be</c><00:14:29.920><c> compensated</c><00:14:30.560><c> by</c>

00:14:30.670 --> 00:14:30.680 align:start position:0%
advantage so far can be compensated by
 

00:14:30.680 --> 00:14:32.990 align:start position:0%
advantage so far can be compensated by
being<00:14:30.920><c> much</c><00:14:31.160><c> faster</c><00:14:31.640><c> per</c><00:14:32.000><c> unit</c><00:14:32.400><c> wall</c><00:14:32.600><c> clock</c>

00:14:32.990 --> 00:14:33.000 align:start position:0%
being much faster per unit wall clock
 

00:14:33.000 --> 00:14:34.829 align:start position:0%
being much faster per unit wall clock
and<00:14:33.200><c> lighter</c><00:14:33.560><c> on</c><00:14:33.720><c> memory,</c><00:14:34.200><c> so</c><00:14:34.320><c> it</c><00:14:34.400><c> can</c><00:14:34.520><c> afford</c>

00:14:34.829 --> 00:14:34.839 align:start position:0%
and lighter on memory, so it can afford
 

00:14:34.839 --> 00:14:37.470 align:start position:0%
and lighter on memory, so it can afford
more<00:14:35.000><c> exploration</c><00:14:35.680><c> than</c><00:14:35.839><c> GRPO.</c><00:14:36.720><c> I</c><00:14:36.839><c> personally</c>

00:14:37.470 --> 00:14:37.480 align:start position:0%
more exploration than GRPO. I personally
 

00:14:37.480 --> 00:14:39.190 align:start position:0%
more exploration than GRPO. I personally
think<00:14:37.760><c> more</c><00:14:37.959><c> experiments</c><00:14:38.600><c> are</c><00:14:38.760><c> needed</c><00:14:39.079><c> to</c>

00:14:39.190 --> 00:14:39.200 align:start position:0%
think more experiments are needed to
 

00:14:39.200 --> 00:14:41.630 align:start position:0%
think more experiments are needed to
draw<00:14:39.480><c> a</c><00:14:39.560><c> better</c><00:14:39.839><c> comparison</c><00:14:40.360><c> to</c><00:14:40.480><c> GRPO</c><00:14:41.120><c> or</c><00:14:41.360><c> even</c>

00:14:41.630 --> 00:14:41.640 align:start position:0%
draw a better comparison to GRPO or even
 

00:14:41.640 --> 00:14:43.710 align:start position:0%
draw a better comparison to GRPO or even
existing<00:14:42.120><c> RL</c><00:14:42.400><c> methods,</c><00:14:42.880><c> but</c><00:14:43.040><c> I</c><00:14:43.120><c> do</c><00:14:43.400><c> think</c>

00:14:43.710 --> 00:14:43.720 align:start position:0%
existing RL methods, but I do think
 

00:14:43.720 --> 00:14:45.829 align:start position:0%
existing RL methods, but I do think
evolution<00:14:44.160><c> strategies</c><00:14:44.760><c> is</c><00:14:45.079><c> really</c><00:14:45.320><c> promising</c>

00:14:45.829 --> 00:14:45.839 align:start position:0%
evolution strategies is really promising
 

00:14:45.839 --> 00:14:47.870 align:start position:0%
evolution strategies is really promising
from<00:14:46.079><c> reading</c><00:14:46.360><c> these</c><00:14:46.560><c> few</c><00:14:46.760><c> papers.</c><00:14:47.280><c> So,</c><00:14:47.600><c> I'm</c>

00:14:47.870 --> 00:14:47.880 align:start position:0%
from reading these few papers. So, I'm
 

00:14:47.880 --> 00:14:49.430 align:start position:0%
from reading these few papers. So, I'm
really<00:14:48.040><c> excited</c><00:14:48.400><c> to</c><00:14:48.480><c> see</c><00:14:48.680><c> how</c><00:14:48.880><c> it</c><00:14:49.000><c> develop</c>

00:14:49.430 --> 00:14:49.440 align:start position:0%
really excited to see how it develop
 

00:14:49.440 --> 00:14:50.949 align:start position:0%
really excited to see how it develop
over<00:14:49.720><c> time.</c><00:14:50.000><c> What</c><00:14:50.120><c> do</c><00:14:50.200><c> you</c><00:14:50.360><c> think?</c><00:14:50.800><c> Let</c><00:14:50.880><c> me</c>

00:14:50.949 --> 00:14:50.959 align:start position:0%
over time. What do you think? Let me
 

00:14:50.959 --> 00:14:52.510 align:start position:0%
over time. What do you think? Let me
know<00:14:51.120><c> down</c><00:14:51.320><c> in</c><00:14:51.400><c> comments.</c><00:14:52.040><c> So,</c><00:14:52.160><c> yeah,</c><00:14:52.320><c> that's</c>

00:14:52.510 --> 00:14:52.520 align:start position:0%
know down in comments. So, yeah, that's
 

00:14:52.520 --> 00:14:53.990 align:start position:0%
know down in comments. So, yeah, that's
it<00:14:52.600><c> for</c><00:14:52.720><c> this</c><00:14:52.880><c> video.</c><00:14:53.280><c> And</c><00:14:53.400><c> if</c><00:14:53.520><c> you</c><00:14:53.640><c> like</c><00:14:53.800><c> how</c><00:14:53.920><c> I</c>

00:14:53.990 --> 00:14:54.000 align:start position:0%
it for this video. And if you like how I
 

00:14:54.000 --> 00:14:55.670 align:start position:0%
it for this video. And if you like how I
explained<00:14:54.360><c> the</c><00:14:54.520><c> AI</c><00:14:54.680><c> concepts</c><00:14:55.160><c> today,</c><00:14:55.520><c> you</c>

00:14:55.670 --> 00:14:55.680 align:start position:0%
explained the AI concepts today, you
 

00:14:55.680 --> 00:14:56.870 align:start position:0%
explained the AI concepts today, you
should<00:14:55.839><c> definitely</c><00:14:56.120><c> check</c><00:14:56.320><c> out</c><00:14:56.440><c> my</c><00:14:56.560><c> latest</c>

00:14:56.870 --> 00:14:56.880 align:start position:0%
should definitely check out my latest
 

00:14:56.880 --> 00:14:58.949 align:start position:0%
should definitely check out my latest
project,<00:14:57.360><c> intuitiveai.academy,</c>

00:14:58.949 --> 00:14:58.959 align:start position:0%
project, intuitiveai.academy,
 

00:14:58.959 --> 00:15:00.510 align:start position:0%
project, intuitiveai.academy,
where<00:14:59.079><c> it</c><00:14:59.160><c> contains</c><00:14:59.720><c> an</c><00:14:59.880><c> intuitive</c>

00:15:00.510 --> 00:15:00.520 align:start position:0%
where it contains an intuitive
 

00:15:00.520 --> 00:15:02.829 align:start position:0%
where it contains an intuitive
explanation<00:15:01.160><c> of</c><00:15:01.400><c> all</c><00:15:01.560><c> modern</c><00:15:01.959><c> LMs</c><00:15:02.520><c> from</c><00:15:02.760><c> the</c>

00:15:02.829 --> 00:15:02.839 align:start position:0%
explanation of all modern LMs from the
 

00:15:02.839 --> 00:15:04.390 align:start position:0%
explanation of all modern LMs from the
ground<00:15:03.160><c> up,</c><00:15:03.440><c> ranging</c><00:15:03.760><c> from</c><00:15:04.000><c> LM</c>

00:15:04.390 --> 00:15:04.400 align:start position:0%
ground up, ranging from LM
 

00:15:04.400 --> 00:15:07.310 align:start position:0%
ground up, ranging from LM
architectures,<00:15:05.360><c> LoRa,</c><00:15:05.920><c> to</c><00:15:06.120><c> how</c><00:15:06.360><c> MoEs</c><00:15:06.920><c> work.</c><00:15:07.280><c> A</c>

00:15:07.310 --> 00:15:07.320 align:start position:0%
architectures, LoRa, to how MoEs work. A
 

00:15:07.320 --> 00:15:09.510 align:start position:0%
architectures, LoRa, to how MoEs work. A
total<00:15:07.720><c> of</c><00:15:07.920><c> 24</c><00:15:08.480><c> chapters</c><00:15:09.000><c> are</c><00:15:09.160><c> currently</c>

00:15:09.510 --> 00:15:09.520 align:start position:0%
total of 24 chapters are currently
 

00:15:09.520 --> 00:15:11.670 align:start position:0%
total of 24 chapters are currently
available<00:15:10.079><c> and</c><00:15:10.320><c> will</c><00:15:10.520><c> be</c><00:15:10.720><c> updated</c><00:15:11.200><c> monthly.</c>

00:15:11.670 --> 00:15:11.680 align:start position:0%
available and will be updated monthly.
 

00:15:11.680 --> 00:15:13.230 align:start position:0%
available and will be updated monthly.
This<00:15:11.839><c> is</c><00:15:11.920><c> the</c><00:15:12.000><c> start</c><00:15:12.240><c> of</c><00:15:12.360><c> a</c><00:15:12.400><c> series</c><00:15:12.839><c> where</c><00:15:13.079><c> I'll</c>

00:15:13.230 --> 00:15:13.240 align:start position:0%
This is the start of a series where I'll
 

00:15:13.240 --> 00:15:15.310 align:start position:0%
This is the start of a series where I'll
break<00:15:13.440><c> down</c><00:15:13.680><c> AI</c><00:15:13.839><c> topics</c><00:15:14.240><c> intuitively</c><00:15:15.040><c> because</c>

00:15:15.310 --> 00:15:15.320 align:start position:0%
break down AI topics intuitively because
 

00:15:15.320 --> 00:15:16.870 align:start position:0%
break down AI topics intuitively because
I<00:15:15.440><c> genuinely</c><00:15:16.160><c> think</c><00:15:16.400><c> anyone</c><00:15:16.720><c> could</c>

00:15:16.870 --> 00:15:16.880 align:start position:0%
I genuinely think anyone could
 

00:15:16.880 --> 00:15:18.829 align:start position:0%
I genuinely think anyone could
understand<00:15:17.440><c> them</c><00:15:17.680><c> no</c><00:15:17.839><c> matter</c><00:15:18.200><c> how</c><00:15:18.360><c> difficult</c>

00:15:18.829 --> 00:15:18.839 align:start position:0%
understand them no matter how difficult
 

00:15:18.839 --> 00:15:20.350 align:start position:0%
understand them no matter how difficult
it<00:15:18.920><c> may</c><00:15:19.079><c> seem.</c><00:15:19.480><c> So,</c><00:15:19.600><c> for</c><00:15:19.760><c> those</c><00:15:20.000><c> who</c><00:15:20.079><c> want</c><00:15:20.280><c> to</c>

00:15:20.350 --> 00:15:20.360 align:start position:0%
it may seem. So, for those who want to
 

00:15:20.360 --> 00:15:22.630 align:start position:0%
it may seem. So, for those who want to
get<00:15:20.520><c> into</c><00:15:20.760><c> AI</c><00:15:21.040><c> or</c><00:15:21.240><c> LMs,</c><00:15:21.880><c> this</c><00:15:22.079><c> should</c><00:15:22.360><c> be</c><00:15:22.520><c> the</c>

00:15:22.630 --> 00:15:22.640 align:start position:0%
get into AI or LMs, this should be the
 

00:15:22.640 --> 00:15:24.190 align:start position:0%
get into AI or LMs, this should be the
perfect<00:15:23.000><c> place</c><00:15:23.200><c> for</c><00:15:23.360><c> you</c><00:15:23.480><c> to</c><00:15:23.600><c> dive</c><00:15:23.839><c> into</c><00:15:24.079><c> the</c>

00:15:24.190 --> 00:15:24.200 align:start position:0%
perfect place for you to dive into the
 

00:15:24.200 --> 00:15:26.270 align:start position:0%
perfect place for you to dive into the
technical<00:15:24.640><c> parts</c><00:15:25.000><c> without</c><00:15:25.280><c> being</c><00:15:25.880><c> afraid</c><00:15:26.160><c> of</c>

00:15:26.270 --> 00:15:26.280 align:start position:0%
technical parts without being afraid of
 

00:15:26.280 --> 00:15:28.430 align:start position:0%
technical parts without being afraid of
crazy-looking<00:15:27.120><c> maths.</c><00:15:27.680><c> And</c><00:15:27.800><c> right</c><00:15:27.959><c> now,</c><00:15:28.200><c> I</c><00:15:28.280><c> am</c>

00:15:28.430 --> 00:15:28.440 align:start position:0%
crazy-looking maths. And right now, I am
 

00:15:28.440 --> 00:15:30.150 align:start position:0%
crazy-looking maths. And right now, I am
also<00:15:28.680><c> putting</c><00:15:28.920><c> out</c><00:15:29.040><c> a</c><00:15:29.120><c> new</c><00:15:29.360><c> launch</c><00:15:29.720><c> discount</c>

00:15:30.150 --> 00:15:30.160 align:start position:0%
also putting out a new launch discount
 

00:15:30.160 --> 00:15:32.310 align:start position:0%
also putting out a new launch discount
for<00:15:30.280><c> 2026,</c><00:15:31.000><c> so</c><00:15:31.079><c> you</c><00:15:31.160><c> can</c><00:15:31.280><c> use</c><00:15:31.440><c> the</c><00:15:31.520><c> code</c><00:15:31.920><c> early</c>

00:15:32.310 --> 00:15:32.320 align:start position:0%
for 2026, so you can use the code early
 

00:15:32.320 --> 00:15:34.470 align:start position:0%
for 2026, so you can use the code early
for<00:15:32.520><c> 40%</c><00:15:33.160><c> off</c><00:15:33.360><c> a</c><00:15:33.400><c> yearly</c><00:15:33.800><c> plan.</c><00:15:34.200><c> And</c><00:15:34.320><c> thank</c><00:15:34.400><c> you</c>

00:15:34.470 --> 00:15:34.480 align:start position:0%
for 40% off a yearly plan. And thank you
 

00:15:34.480 --> 00:15:36.069 align:start position:0%
for 40% off a yearly plan. And thank you
guys<00:15:34.640><c> for</c><00:15:34.760><c> watching.</c><00:15:35.360><c> A</c><00:15:35.400><c> big</c><00:15:35.600><c> shout-out</c><00:15:36.000><c> to</c>

00:15:36.069 --> 00:15:36.079 align:start position:0%
guys for watching. A big shout-out to
 

00:15:36.079 --> 00:15:39.670 align:start position:0%
guys for watching. A big shout-out to
Spam<00:15:36.440><c> Maj,</c><00:15:37.200><c> Chris</c><00:15:37.480><c> Ladue,</c><00:15:38.400><c> Deegan,</c><00:15:39.360><c> Robert</c>

00:15:39.670 --> 00:15:39.680 align:start position:0%
Spam Maj, Chris Ladue, Deegan, Robert
 

00:15:39.680 --> 00:15:42.550 align:start position:0%
Spam Maj, Chris Ladue, Deegan, Robert
Zaviassa,<00:15:40.760><c> Marcelo</c><00:15:41.200><c> Ferreria,</c><00:15:42.200><c> Proof</c><00:15:42.400><c> and</c>

00:15:42.550 --> 00:15:42.560 align:start position:0%
Zaviassa, Marcelo Ferreria, Proof and
 

00:15:42.560 --> 00:15:46.030 align:start position:0%
Zaviassa, Marcelo Ferreria, Proof and
Inu,<00:15:43.320><c> DX</c><00:15:43.680><c> Research</c><00:15:44.040><c> Group,</c><00:15:44.720><c> Alex,</c><00:15:45.680><c> Midwest</c>

00:15:46.030 --> 00:15:46.040 align:start position:0%
Inu, DX Research Group, Alex, Midwest
 

00:15:46.040 --> 00:15:47.870 align:start position:0%
Inu, DX Research Group, Alex, Midwest
Maker,<00:15:46.760><c> and</c><00:15:46.880><c> many</c><00:15:47.079><c> others</c><00:15:47.320><c> that</c><00:15:47.440><c> support</c><00:15:47.720><c> me</c>

00:15:47.870 --> 00:15:47.880 align:start position:0%
Maker, and many others that support me
 

00:15:47.880 --> 00:15:49.630 align:start position:0%
Maker, and many others that support me
through<00:15:48.079><c> Patreon</c><00:15:48.440><c> or</c><00:15:48.520><c> YouTube.</c><00:15:49.120><c> Follow</c><00:15:49.360><c> me</c><00:15:49.520><c> on</c>

00:15:49.630 --> 00:15:49.640 align:start position:0%
through Patreon or YouTube. Follow me on
 

00:15:49.640 --> 00:15:51.310 align:start position:0%
through Patreon or YouTube. Follow me on
Twitter<00:15:49.839><c> if</c><00:15:49.959><c> you</c><00:15:50.040><c> haven't,</c><00:15:50.440><c> and</c><00:15:50.560><c> I'll</c><00:15:50.720><c> see</c><00:15:50.959><c> you</c>

00:15:51.310 --> 00:15:51.320 align:start position:0%
Twitter if you haven't, and I'll see you
 

00:15:51.320 --> 00:15:54.280 align:start position:0%
Twitter if you haven't, and I'll see you
in<00:15:51.440><c> the</c><00:15:51.560><c> next</c><00:15:51.880><c> one.</c>

