James Kelly, Product Management, Sr. Director, Juniper Networks

Don’t Let Your AI Get Caught in Traffic

Network AutomationAI & ML
James Kelly Headshot
A screenshot of the video showing host James Kelly, Product Management, Sr. Director, Juniper Networks. Text over the screenshot says, “Don’t Let Your AI Get Caught in Traffic: Speed and Juniper Apstra and Terraform.”

Demo: Automate Juniper Apstra software using Apstra Terraform provider for AI clusters.

In this hands-on demo, James Kelly shows step-by-step how to use Juniper Apstra® software to optimize AI networking fabrics to avoid the slowing effects of congestion and packet drops.

Show more

You’ll learn

  • How to set up a working AI cluster with three different fabrics in Apstra software

  • How dynamic load balancing helps alleviate congestion vs. regular ECMP load balancing

Who is this for?

Network Professionals

Host

James Kelly Headshot
James Kelly
Product Management, Sr. Director, Juniper Networks

Transcript

0:00 don't get caught in traffic don't play

0:02 in traffic and stay in your lane today

0:05 we're going to be paying attention to

0:07 exactly none of that and doing the exact

0:10 opposite I'm James Kelly from Juniper

0:12 Networks we're going to be talking about

0:14 playing in traffic at high speeds in AI

0:17 clusters and exactly how you manage that

0:19 let's get into it all right let's take a

0:21 look at our handy terraform abstra

0:24 examples repository if you were with me

0:27 in the last demo about AI topology

0:31 design you'll know that this demo is

0:33 going to show you how to automate abstra

0:36 with our abstra terraform provider for

0:39 AI clusters last time we were talking

0:41 about rail optimized design an Nvidia

0:43 prescribed feature and how we make that

0:45 easy with juniper abstruct in our intent

0:48 based multivendor fabric manager at

0:50 Juniper Networks today we're going to be

0:52 talking about a different part of our

0:56 best practice recommended design and

0:58 last time we were looking at this

0:59 subfolder of the examples I'm going to

1:01 go into this example lab of a real AI

1:06 cluster that we've set up with three

1:08 different Fabrics or three different

1:10 so-called blueprints in abro and I've

1:15 actually got an instance of apure Spun

1:17 up in cloud Labs you'll know that I

1:19 mentioned last time too that uh you can

1:21 delete and create your own topologies

1:23 here and uh just at a few minutes after

1:28 a click of a button you'll have access

1:29 access to abstra and be able to log in

1:32 so when you open it up in a new tab you

1:34 basically have this and it asks you to

1:36 log in and it gives you the credentials

1:37 right

1:38 there also like last time I'm going to

1:42 start off by uh showing you my GitHub

1:45 desktop I've just pulled down the

1:48 repository that I was showing you in my

1:49 browser terraform apture examples that

1:52 has all of the Juniper examples in there

1:56 and we're going to be using that one of

1:57 them of course so I'm going to open this

1:59 up up in Visual Studio

2:02 code like last time the first thing that

2:04 I have to do is actually point my

2:07 provider at my instance of

2:12 abstra and I'm going to just copy and

2:15 paste this I had a quick note outside of

2:18 the screen recording that's how I did

2:20 that so easily but admin is the username

2:23 amazing catv your dollar sign is the

2:25 password and this is the IP address that

2:27 you saw in the last

2:30 browser tab let me just flash back there

2:33 so

2:34 that you know you can see exactly what

2:37 I'm talking about here's the IP address

2:40 here's the credentials all right

2:44 so back to this I'm just going to now uh

2:49 save this

2:50 file and take a look at uh what we're

2:54 talking about in our blog was Dynamic

2:57 load balancing and how it helps

2:59 alleviate congestion over regular ecmp

3:03 load balancing and the possibilities of

3:06 using some packet spraying in the future

3:08 and other forms of load balancing um as

3:11 part of abstra reference designs those

3:14 intent-based very nicely validated

3:17 designs that are laid down as a data

3:20 center Network fabric um we're going to

3:23 be using the layer 3 routed only designs

3:27 and as part of that it doesn't include

3:30 the configuration in junos for dynamic

3:33 load balancing if you want to learn more

3:35 about Dynamic load balancing I'm going

3:37 to keep the video short and ask you to

3:39 go and read the details in the blog it's

3:41 also a little bit of self promotion

3:42 since I wrote the blog I would encourage

3:45 you to read it I would hope um this is

3:47 the actual very simple configuration in

3:49 junos and if you're familiar with junos

3:51 this should the stanza sort of syntax

3:54 look very familiar to you you can change

3:57 the inactivity timer in microsc or you

4:00 can just leave it out completely and

4:01 it'll default to

4:03 256 I wanted to show that to you um also

4:07 unlike last time where we only looked at

4:09 designs I've got a blueprints file here

4:12 and this really creates the blueprints

4:15 creates some of the resources that are

4:16 necessary for the blueprints and I'm

4:18 going to be applying as I described in

4:21 the blog the configlet that was created

4:25 in that other file to two different

4:27 Fabrics one for my back end GPU to GPU

4:31 Fabric and one for my storage fabric

4:34 both of those are the Rocky very high

4:36 bandwidth fairly low flow count Fabrics

4:40 that'll highly benefit from Dynamic load

4:42 balancing as compared to you know other

4:44 kinds of load balancing and as an

4:47 example you could apply this to all

4:48 devices we've just you know given this

4:50 little condition in here if you only

4:52 wanted to apply it to the leaves for

4:54 example if you had let's say single

4:56 links between your leaves and spines

4:59 dynamic balancing really wouldn't help

5:01 you at all on the way back from The

5:03 Spine down to the leaf that would be an

5:05 example of how you would do that all

5:07 right so I'm not going to go into any

5:09 more about the configlet I'll let you

5:11 check that out at your own Leisure what

5:13 I will show you though is that coming in

5:17 here like last time we can before I do

5:20 this I need to change into the right

5:25 folder and I'll do a terraform in it we

5:29 just make sure that I have the right

5:31 version installed the most recent

5:32 version of the abstra terraform

5:36 provider and if I do a terraform plan

5:39 this is a know blank slate in terms of

5:42 my instance of Abra that I spun up it'll

5:44 say that there's 58 different resources

5:47 to

5:48 add and I will do a terraform

5:52 apply and before I type yes what I can

5:56 do I suppose is perhaps this just kind

5:58 of split the the screen here and let's

6:01 look at

6:03 abstra and you'll remember how all of

6:05 the designs showed up for example under

6:08 the racks and the templates in the last

6:10 demo that I did this time if we go into

6:13 blueprints you see that there's

6:14 absolutely no blueprints here right

6:18 now as soon as I say yes here it's going

6:21 to start firing away at the abstra AP

6:25 the teror form provider for Abra is

6:27 implemented in go and that uses a go SDK

6:30 for abstra was also um open sourced and

6:34 all of this stuff is going just in a

6:36 matter of seconds creating three

6:39 different fabric configurations inside

6:42 of abstr now what this terraform

6:45 actually doesn't do that will be adding

6:47 in the future as part of my next demo as

6:49 I'm going step by step through this is

6:52 additional configlets I talked in the

6:54 blog about you know hinting at bcq CN in

6:56 the next and then looking at some

6:58 Analytics um so this isn't actually

7:01 going to commit anything to the Juniper

7:03 Juno devices we'll save that for a later

7:05 demo but you can see basically now that

7:07 the terraform stuff is all done it's

7:10 gone and created the backend GPU fabric

7:13 backend storage Fabric and let's now

7:18 make this window a bit

7:19 bigger the front end management Fabric

7:22 and if you look inside of any one of

7:24 these you'll have to come into the stage

7:27 Tab and you can see that there's certain

7:29 types of resources that were allocated

7:31 this was done in that blueprints uh. TF

7:34 file and there's certain things for

7:36 example like the spine and the leaf that

7:39 aren't yet actually allocated right so

7:43 if you wanted to for example go and you

7:46 know allocate those here you could do

7:48 that manually as I said we're going to

7:49 save all of that automation for a later

7:52 demo and show that to you as well and of

7:55 course none of this is committed because

7:57 for it to be committed we have to be

7:58 using real devices as I'm putting this

8:00 together quickly uh like I said one step

8:02 at a time so we will get there in later

8:04 demos stay tuned for how this actually

8:07 shows up at the device level and then

8:10 you can kind of see for example when you

8:12 see one of these Juniper devices all of

8:14 the configuration for it and we'll take

8:17 a look at how that configlet gets

8:18 applied here in the next demo in the

8:21 meantime you can see that uh this is the

8:23 front-end fabric where we actually

8:24 didn't apply the config so I should look

8:26 at let's say the storage or the other

8:28 one and when I go in there if I go into

8:32 the config lates here excuse me you'll

8:34 see this config here DLB for AI leaves

8:39 that was the config that we provide uh

8:42 provisioned excuse me using

8:46 terraform all right that concludes the

8:48 demo I hope you'll go and check out this

8:50 repository you can easily pull it down

8:52 and walk through these things using

8:54 Astra Cloud labs and you can see how

8:56 easy it is to create and automate stuff

8:58 in asra

9:00 and stay tuned for the next step on the

9:03 journey to creating and automating AI

9:06 training clusters with Juniper Networks

9:09 I'm James Kelly thank you

Show more