James Kelly, Senior Director of Product Management, Juniper Networks

Automating AI Cluster Network Design With Apstra and Terraform

AI & MLTrending
James Kelly Headshot
A screenshot from the video of host James Kelly, Senior Director of Product Management, Juniper Networks as he speaks. Text appears over the image that says, “Automating AI Cluster Network Design With Apstra and Terraform.”

Putting cloud, automation, and AI together to work for you.

Watch this hands-on demo to see how to terraform apply to import many AI cluster-design examples into Juniper Apstra™ intent-based networking software.

Show more

You’ll learn

  • About design types, such as sizes of clusters, GPU compute fabrics for model training, storage fabrics, and management fabrics

  • How the logical rack types follow NVIDIA best practices for rail-optimized design

Who is this for?

Network Professionals

Host

James Kelly Headshot
James Kelly
Senior Director of Product Management, Juniper Networks

Transcript

0:03 you've heard about clouds you've heard

0:05 about Automation and you've definitely

0:06 heard a lot about AI recently we're

0:08 going to be putting all of those things

0:10 together today in this quick demo I'm

0:11 James Kelly from Juniper Networks I'm

0:13 going to be talking about setting up and

0:15 automating the design of AI clusters

0:18 with juniper abstra and the new

0:20 terraform provider for abstra let's get

0:23 into it

0:26 all right let's start in our browser by

0:28 looking at this repository on GitHub

0:30 this is the terraform abstra examples

0:33 repository under the Juniper

0:35 organization this is accessible and open

0:37 to everyone

0:39 um here you'll find many examples and a

0:42 growing list of them as well of things

0:44 that you can automate inside of juniper

0:46 abstra which is of course I should

0:48 probably explain Juniper Network's

0:51 intent based management tool for data

0:54 center Fabrics it's also multi-vendor

0:57 and very well known for that

0:59 um terraform is obviously an

1:02 infrastructure as code automation tool

1:04 and the terraform provider for App Store

1:06 allows you to drive abstra and hence

1:07 your data centers through terraform and

1:10 one of these kinds of data centers that

1:12 I mentioned in the opening is AI

1:15 clusters and AI clusters there's some

1:17 differences to the fabric design and of

1:20 course abstra lets you customize fabric

1:22 design but rather than you know a

1:25 customer say starting out with having to

1:27 design their data centers we have many

1:29 examples that we put together for

1:31 different sizes of AI clusters that you

1:33 can just apply to abstra in a matter of

1:35 a few seconds with terraform and that's

1:38 what I'm going to be showing you guys

1:39 today

1:40 so one of the subfolders of this

1:43 repository is AI clusters and a few

1:47 people including myself have put

1:48 together this automation to do these

1:51 examples and make as I said various

1:54 sizes of clusters easily creatable

1:56 inside of abstract in terms of the

1:58 design so you'll find a whole bunch of

2:00 things here in terms of all of the steps

2:02 that you would need one of the very

2:04 first things you'd want to do is of

2:06 course install terraform if you don't

2:08 have that say on your laptop there's

2:10 ways of using terraform Cloud that is

2:12 explained in some of the other examples

2:13 I'm not going to show that today

2:15 um and beyond that you also need an

2:18 instance of abstra now you might already

2:19 have abstra running in your data center

2:21 but you know rather than having to

2:24 install or set up an instance of appstra

2:26 one of the easy ways you can access

2:28 abstra is through abstract Cloud Labs

2:30 you can easily start a topology here for

2:32 free this is open and accessible to

2:34 everyone when you do that you can pick

2:36 an expiration time and after a few

2:40 minutes it will spin up an instance of

2:43 abstra and a topology now in my specific

2:46 sandbox in apps for cloud Labs I elected

2:49 to just do an abstra only instance so

2:52 there's no actual physical devices since

2:53 I'm just going to be showing the

2:55 automation of The Logical design today I

2:57 don't need any physical boxes and you

2:59 can see that there's a simple button

3:01 here open a new tab that'll allow you to

3:04 log in

3:05 so I'm going to log in with the password

3:08 that it provided here

3:09 in the username admin of course

3:13 now this is a fresh instance of apps for

3:15 that I really haven't done anything to

3:17 and it's a welcome screen here actually

3:21 talks about building racks and designing

3:23 the networks and then creating and

3:25 deploying a blueprint and this order is

3:27 relevant because this is the order that

3:28 you would typically design and then

3:31 deploy a blueprint data center with

3:33 as I mentioned we're going to be

3:35 automating the design of AI clusters and

3:38 that turns out to be about all about

3:41 um you know logical devices the racks

3:44 and the templates and then of course the

3:45 template is used to deploy the blueprint

3:48 and you can stamp those out again and

3:50 again so as I mentioned the design phase

3:53 is certainly customizable but rather

3:56 than having you have to point and click

3:57 your way around that you can automate

3:59 things with terraform and we've

4:00 automated all of these examples for you

4:02 so let's have a look at that let me just

4:05 go into racks first of all and assure

4:07 you that this is a standard out of the

4:09 box abstra instruments there's nothing

4:11 in here nor in any of the templates that

4:15 doesn't come out of the box with app

4:16 strip what we're going to see is that

4:18 there's a whole bunch of templates

4:20 related to AI cluster networks that are

4:22 in here and we'll talk about some of the

4:25 nuances and differences in that design

4:28 and how that is important to AI use

4:31 cases such as model training so now that

4:33 we've got this up and running one of the

4:35 things that we need to do is actually

4:37 get this example terraform HCL

4:40 configuration onto the laptop so I'm

4:45 going to use GitHub desktop because it's

4:47 a nice visual tool makes sense to demo

4:49 from it I've already logged into my

4:52 GitHub desktop instance and if I just

4:55 start typing some of this stuff you can

4:57 see I can easily come and choose to

4:59 clone the repository that was just

5:01 showing you in the browser

5:03 so now all of that stuff is downloaded

5:05 to my laptop I happen to have Visual

5:08 Studio code installed and this handy

5:10 button will open it up just like that

5:12 I'm going to just make this instance of

5:15 Visual Studio code a little bit bigger

5:17 to match my video

5:19 and you've got all of the examples we're

5:22 going to only need to look at the AI

5:24 clusters bit of this now I'm not going

5:27 to go into all of the HCL configuration

5:29 the one thing you do need to see and

5:32 actually change is the place that you're

5:36 going to point terraform to so we need

5:38 to change the username and password to

5:41 match what we got from Astra Cloud labs

5:43 and we need to change this after URL as

5:46 well let's go back to our instance of

5:48 Astra here copy the IP address and port

5:51 number

5:52 and then go back into Visual Studio code

5:54 and replace that after URL placeholder

5:57 with this

5:58 okay so from there I'm just going to

6:00 save the file

6:02 and

6:04 close this down one of the things that

6:07 you can do in Visual Studio code is open

6:09 up a terminal

6:11 let's go into the AI cluster subfolder

6:14 from here I'm going to do a terraform

6:16 init that'll just make sure that I have

6:19 the most recent version of the terraform

6:22 provider for abstra

6:24 after that you can if you would like to

6:27 do a terraform plan it's an optional

6:30 step before you apply

6:33 and it looks like I have a error in my

6:36 provider file

6:38 right so one of the things that I happen

6:40 to have in my password that I need to

6:42 change is that little

6:45 symbol there it's got to be URL encoded

6:48 I'm going to go back into here and just

6:50 resave the file

6:52 didn't expect that but that's a live

6:55 demo for you okay after that it's happy

6:59 and it says that it's found 53 resources

7:02 to add nothing to change nothing to

7:03 destroy of course because we haven't

7:05 applied anything yet so that all looks

7:07 good

7:09 now let's do a terraform apply and while

7:13 we're doing this what I wanted to show

7:15 is how fast things happen in abstra at

7:18 the same time so I've got these

7:19 templates up in the back here right so

7:22 watch those templates all change as many

7:25 of the templates are now put into app

7:27 strip with this terraform apply just

7:29 have to answer yes I'm ready to go and

7:32 boom you can see all that happening

7:34 and then in the browser over here you

7:37 can see how quickly all of these things

7:39 appeared now what's pretty neat about

7:41 all of these examples of for example

7:44 different sizes of GPU clusters some

7:47 large ones we've got certainly quite a

7:49 bit here in terms of 256 dgx those are

7:53 Nvidia servers with the h100 gpus or the

7:56 a100 gpus what you'll see if these

7:59 examples and the ones like it at smaller

8:01 sizes

8:03 like that guy and that guy and that guy

8:05 these are the back end training Fabrics

8:08 that are used for model training and

8:11 when you're running an Ethernet training

8:13 fabric you've got RDMA over converged

8:15 ethernet we'll call it Rocky for short

8:18 roce and that Rocky fabric has a very

8:21 special design

8:23 recommended by Nvidia to drive the

8:26 maximum performance and job completion

8:28 time

8:29 and networking performance drives

8:31 Network job completion time of course in

8:34 your network and this special design is

8:36 called a rail optimized design where

8:39 rail local traffic can go over fewer

8:42 hops because these 64 dgx servers

8:46 actually have an internal switch to it

8:49 between the gpus so you don't need to go

8:51 for example from GPU you know one to GPU

8:54 2 within the same server across the top

8:56 of rack Leaf switch it's able to do that

8:59 just inside of the server and it has a

9:03 special feature of the rail optimized

9:05 design in the newer versions of Nvidia

9:07 nickel this feature is called pxn and it

9:10 allows even further rail local

9:12 optimizations such as when two different

9:15 servers are talking to each other

9:17 you'll have traffic that doesn't have to

9:20 go over the spine Network and we can

9:23 explain that in just a second but I

9:25 wanted to kind of go into one of these

9:27 clusters and show you what it looks like

9:29 of course

9:31 and then sort of explain some of the

9:33 rail optimized design from there you can

9:36 see the option to expand things and

9:38 whatnot

9:39 this happens to be using a certain rack

9:42 type rather than just click that I'm

9:44 going to actually go into

9:46 all of the racks and show you all of the

9:48 different types of racks here for the

9:50 storage fabrics for the management Front

9:52 End fabrics

9:54 and the rack that I was just looking at

9:55 is this one right here

9:57 now

9:58 this does not look like a typical data

10:01 center rack and in fact this is not a

10:03 physical rack design

10:05 in this case in order to accommodate for

10:07 nvidia's rail optimized design what

10:10 we've done is we've built a custom rack

10:12 type inside of abstra for what we call a

10:14 stripe or in other words it's a group of

10:17 eight Leaf switches and why eight well

10:20 because you have eight gpus on the

10:22 server and like I said those servers

10:25 that you see here the 16 of them could

10:27 be you know dgx or they could be hgx

10:29 based which just means you can get them

10:31 from like say a Dell or a super micro or

10:33 someone else and they follow the same

10:35 pattern

10:36 of having that internal switch in eight

10:38 gpus and effectively the same build out

10:40 of hardware

10:41 now

10:43 what you'll see in this design called

10:45 rail optimize is that the GPU one inside

10:48 of the server in this special Network

10:50 for the gpus to interconnect again this

10:53 is for model training or sometimes

10:55 called the back end Network you'll see

10:57 that these eight different gpus are

10:59 individually cabled up to the eight

11:01 different Leafs inside of this stripe

11:04 and this is of course not a typical data

11:06 center design in a typical data center

11:08 to design what you'd expect is probably

11:10 many fewer ports on each server and

11:13 you'd expect those ports to go up to one

11:15 top of racks switch or sometimes

11:18 aggregated to a pair of of switches that

11:22 might be connected with ESI lag

11:24 so this is of course very different in

11:27 these IP Fabrics when I mentioned that

11:30 pxn feature is able to do is when you

11:33 have traffic let's say going from GPU

11:36 one of this server to GPU one of the

11:39 second server they'll of course be able

11:41 to you know since they're cabled up to

11:43 the same Leaf be directly accessing each

11:46 other with just a single hop across that

11:49 top of rack Leaf switch now what if

11:52 let's say GPU one of this server needs

11:54 to talk to GPU two of the other server

11:58 well you'd expect that you know you

12:00 probably expect I think that you'd go

12:02 from this Leaf switch here to a spine

12:07 switch and then down to the leaf switch

12:09 with which you know gpu2 is connected

12:13 yeah that might make normal sense but as

12:16 I mentioned this special nickel feature

12:18 from Nvidia called pxn allows This

12:20 Server to understand the overall

12:22 topology and it will use the internal

12:24 switch in the server to pass the traffic

12:26 from gpu1 to GPU 2 locally and then

12:30 it'll send it to the top of rack switch

12:32 representing all of the GPU twos and

12:35 then down to this second server the cool

12:38 thing about that is you know it really

12:40 optimizes for latency and performance is

12:43 really key inside of these use cases

12:45 because of course these servers are very

12:48 expensive hundreds of thousands of

12:49 dollars each GPU is very expensive right

12:52 tens of thousands of dollars and to the

12:55 extent that your gpus are sitting idle

12:59 wasting time waiting for the network

13:00 you're losing return on investment of

13:03 course right so this is why you know the

13:06 network may not seem like the most

13:08 important thing to AI clusters you'd

13:10 think it's got to be the gpus and all of

13:12 those servers there's such a big expense

13:14 but if your network is holding back the

13:16 performance of your model training and

13:18 your gpus you know what good are your

13:20 gpus to you they're not right this is

13:22 why the network performance is really

13:24 key when you're designing AI clusters

13:26 and with these examples you know you'll

13:29 be following the best practices in

13:31 starting from a foundation that is

13:33 strong to assure the best performance

13:35 possible

13:37 when would traffic be going through the

13:39 spine switches you might ask well let's

13:41 go back to look at the template again

13:43 and I'll just pick on this smaller

13:46 cluster because it's a little bit easier

13:47 to see

13:49 one of the times when you would see

13:52 these different groupings here which we

13:55 called Stripes of different you know 16

13:57 different servers or eight leaves of

14:00 course if you know traffic has to go

14:02 from one of these servers to a server in

14:05 a different stripe then of course the

14:07 traffic will cross over the spine

14:09 Network

14:10 and when it's Crossing across the spine

14:12 Network again there's things that you

14:14 want to design for in the spine Network

14:17 such as Dynamic load balancing and

14:19 network congestion management protocols

14:22 like dcqcn which is a combination of you

14:25 know ecn and PFC you can go and read

14:28 about these things in glorious detail in

14:30 the Juniper Juno's documentation and

14:32 we'll talk about automating some of

14:34 those configurations in another video

14:35 but for this video I think I'm done

14:38 explaining the rail optimized design

14:40 that is recommended as a best practice

14:42 from Nvidia and you've seen also how

14:45 simple it is to use terraform and take

14:48 all of these examples for these things

14:50 that I hope you dig into yourself and go

14:52 and apply them into an instance of

14:54 aperture that is really accessible to

14:56 anyone so thank you for your time

14:58 interest in joining me on this quick

15:00 demo Journey again this is James Kelly

15:02 from Juniper Networks reach out to us at

15:04 Juniper if you're building AI clusters

15:05 we'd love to help you

Show more