1
00:00:02.429 --> 00:00:06.810
Brian Vanderwende: Okay, there we go, so the recording of this meeting will be available.
2
00:00:08.250 --> 00:00:10.230
Brian Vanderwende: You know, as soon as we get it processed and uploaded.
3
00:00:11.400 --> 00:00:24.150
Brian Vanderwende: But the basic agenda today it's pretty simple DJ and john I believe we're going to do it sort of a tag team presentation, as part of the Community of practice series of presentations that we've had in these meetings.
4
00:00:25.230 --> 00:00:28.500
Brian Vanderwende: talk a little bit about you know their AIML work.
5
00:00:29.730 --> 00:00:41.520
Brian Vanderwende: And then i'll give a very just a very brief summary of some gpu training that csv and collaboration with tdd is as sort of been designing behind the scenes.
6
00:00:42.990 --> 00:00:43.860
Brian Vanderwende: If you've seen.
7
00:00:45.270 --> 00:00:55.890
Brian Vanderwende: Daniel Howard give talks on this in like a whip talk and some other than us so i'll just give a brief update on that, and then we can do a short Roundtable and we'll wrap up.
8
00:00:57.060 --> 00:01:05.610
Brian Vanderwende: So I believe I saw DJ and john are in here, so if if you both are ready, I can hand over ownership of the screen.
9
00:01:07.800 --> 00:01:14.730
David John Gagne: Yes, thank you Ryan, if you get it over to me, I can share from here okay.
10
00:01:15.480 --> 00:01:16.260
let's see.
11
00:01:23.880 --> 00:01:24.660
I think.
12
00:01:29.640 --> 00:01:31.320
Brian Vanderwende: You have the button or now.
13
00:01:31.620 --> 00:01:35.760
David John Gagne: Of the button by says his participant disabled screen sharing okay.
14
00:01:36.960 --> 00:01:38.940
David John Gagne: or host disabled participants screen sharing.
15
00:01:39.630 --> 00:01:43.230
Brian Vanderwende: Okay let's flip that let's see all participants.
16
00:01:43.950 --> 00:01:50.340
David John Gagne: Can you make me a Co host now I fix it or all participants also works yeah.
17
00:01:51.600 --> 00:01:55.470
Brian Vanderwende: yeah yeah I don't think I can make you co host actually but there we go.
18
00:02:01.860 --> 00:02:02.730
David John Gagne: Right so.
19
00:02:03.330 --> 00:02:04.440
john and I will be.
20
00:02:05.580 --> 00:02:10.650
David John Gagne: kind of talking about machine learning challenges and opportunities on encourage PC.
21
00:02:11.130 --> 00:02:20.730
David John Gagne: i'm going to start off and kind of give like a sort of a big picture overview and kind of talk about some areas that have where I think gpus could be used, but.
22
00:02:21.450 --> 00:02:32.250
David John Gagne: Like need more time investment in order to do so, and then john will talk about kind of some of the ways we're currently using gpus on casper and for some of our different machine learning tasks.
23
00:02:33.960 --> 00:02:34.530
David John Gagne: So.
24
00:02:36.150 --> 00:02:37.110
David John Gagne: to kick things off.
25
00:02:39.750 --> 00:02:44.370
David John Gagne: Like fortman following anything related to hbc a.
26
00:02:45.390 --> 00:02:49.290
David John Gagne: lot there's always lots of discussion on how machine learning is a really fast growing usage area.
27
00:02:49.890 --> 00:02:51.660
David John Gagne: And a lot of hype around it.
28
00:02:51.750 --> 00:02:54.840
David John Gagne: But also, I think a fair bit of substance to so.
29
00:02:56.610 --> 00:02:56.790
David John Gagne: It.
30
00:02:59.250 --> 00:03:06.630
David John Gagne: sounds a bit more both pipe and substance, so a lot of the machine learning gpu usage tends to focus on things like.
31
00:03:07.320 --> 00:03:12.570
David John Gagne: How fast, can you train the model or how fast, can you run it on whatever gpu with other server architecture.
32
00:03:13.110 --> 00:03:29.370
David John Gagne: As example here's a chart from nvidia showing are like performance of the different nvidia Ai on different systems versus the gpus or or other specialized chips.
33
00:03:30.690 --> 00:03:43.050
David John Gagne: Know hustling kind of cryptic or an empty or 3D unit or bird or whatever the link to kind of dip different usually image processing rewards language model kind of tasks.
34
00:03:46.230 --> 00:04:00.690
David John Gagne: And things like this it's so it's useful to kind of keep track of all these benchmarks and like there I think representative tasks for deep, like some of the a subset of deep learning computationally intensive machine learning deep learning tasks in general.
35
00:04:03.000 --> 00:04:09.870
David John Gagne: But they really only represent a fairly small part of the possible usage of like gpus and machine learning.
36
00:04:12.690 --> 00:04:27.720
David John Gagne: and also a very small part of the machine learning pipeline, in fact, like talk about this old normal why the machinery pipeline is like the training part is actually relatively small part of the process there's a lot that goes on before and what happens afterward.
37
00:04:28.740 --> 00:04:35.010
David John Gagne: That like acquires a lot of compute and people attention especially people time.
38
00:04:36.660 --> 00:04:45.330
David John Gagne: So, want to suggest a few areas to facilitate brother gpu usage of machine learning on in car hbc I think we should try to encourage some.
39
00:04:45.360 --> 00:04:49.380
David John Gagne: More gpu sister our pipeline, especially after he has deployed, and we have more.
40
00:04:49.380 --> 00:04:50.430
David John Gagne: gpus available.
41
00:04:51.540 --> 00:04:58.740
David John Gagne: Right now, with casper I think it might sound of it could be justified, but, but if it's like overly encouraged, then we make like.
42
00:05:00.030 --> 00:05:08.010
David John Gagne: may run into the problem of gpus being over utilized and and one Q wait times that that would.
43
00:05:09.330 --> 00:05:13.410
David John Gagne: kind of slow down some of the other like product like productivity at the researchers.
44
00:05:15.810 --> 00:05:20.130
David John Gagne: So kind of give everyone a broad sense of what what is the mission hiring pipeline look like.
45
00:05:22.230 --> 00:05:24.450
David John Gagne: let's kind of the first part is.
46
00:05:25.140 --> 00:05:27.090
David John Gagne: Defining the problem, and this is.
47
00:05:27.570 --> 00:05:31.260
David John Gagne: Like small compute big people time.
48
00:05:32.370 --> 00:05:40.950
David John Gagne: aspect of the of the exercise is also a lot of meetings between machine learning people in domain experts and kind of like sketching out, but what what is the.
49
00:05:42.960 --> 00:05:53.280
David John Gagne: Like what the data says what was the scientific outcome or hypothesis we're trying to test like what data baselines and metrics and stuff should we be using.
50
00:05:55.110 --> 00:05:58.680
David John Gagne: And then there's a whole process of data gathering and data processing that's.
51
00:05:59.070 --> 00:06:05.220
David John Gagne: often takes up about 60 to 70% of a machine learning person's time.
52
00:06:06.660 --> 00:06:17.070
David John Gagne: Just because, like getting the data in the right format is like you often you don't get a successful missionary models, the data is properly structured and you know going variables and predicting the right thing.
53
00:06:18.870 --> 00:06:21.030
David John Gagne: Then there's the compute intensive but off, but.
54
00:06:21.060 --> 00:06:21.600
Matthias Rempel: Maybe less.
55
00:06:21.810 --> 00:06:24.390
David John Gagne: People time intensive part of model selection on the training.
56
00:06:26.760 --> 00:06:34.830
David John Gagne: And all without us also usually once you have a set of train models, you want to be using a lot of evaluation to see how well they're performing and.
57
00:06:35.250 --> 00:06:43.470
David John Gagne: really have any issues were like and we went off and get some science scientific justification over so there's an interpretation process that goes along with this, too.
58
00:06:44.220 --> 00:06:54.030
David John Gagne: And once you've gone through all of this step, then usually we will want to eventually deploy this in some form and research contacts right like We may want.
59
00:06:54.060 --> 00:06:59.460
David John Gagne: We have a system that works really well, we wanted to hand off to someone like Noah or private company or.
60
00:07:01.350 --> 00:07:12.960
David John Gagne: or or even like running it like sort of in closet real time lead ourselves to to turn out well it operates, there may also be planted operations for like non say.
61
00:07:14.400 --> 00:07:19.710
David John Gagne: What non whether kind of stuff but but say have an htc system that I know there's been a number of projects.
62
00:07:19.710 --> 00:07:20.160
Matthias Rempel: Doing.
63
00:07:20.640 --> 00:07:22.080
David John Gagne: Trying to do machine learning to help.
64
00:07:22.110 --> 00:07:40.740
David John Gagne: optimize things like like like alpha like, how can be efficiently subset data or manage come up with keywords for later kind of kind of or data archive to populate like less well described data sets or.
65
00:07:41.100 --> 00:07:57.090
David John Gagne: me like i'm sure other than the ratio, maybe there's some machine learning systems don't manage like the date, like the data Center in cheyenne and they're like heating and cooling stuff but I know I know there's people working on that for other data centers but.
66
00:07:58.380 --> 00:07:59.910
David John Gagne: The main like I think there's other there's.
67
00:08:00.180 --> 00:08:04.470
David John Gagne: Other opportunities are there not just like machine learning for their climate for can be used as.
68
00:08:04.470 --> 00:08:04.860
well.
69
00:08:07.740 --> 00:08:11.640
David John Gagne: I have a list here of someone called beta pre processing hbc challenges.
70
00:08:13.110 --> 00:08:15.510
David John Gagne: I think are issued some of these are issues that are.
71
00:08:15.750 --> 00:08:17.250
David John Gagne: Software based some of them are ones were.
72
00:08:17.640 --> 00:08:18.780
David John Gagne: GPS could benefit.
73
00:08:20.520 --> 00:08:24.060
David John Gagne: One area that's always a challenge getting access to all data.
74
00:08:26.580 --> 00:08:30.030
David John Gagne: For like the same machine learning problem versus like a simulation problem.
75
00:08:30.990 --> 00:08:46.200
David John Gagne: With machine learning, we often need access to like a large archive of data and not just say a single event for like running late initializing a like a single simulation now we want the you know 50 years of 22 tracks or.
76
00:08:47.310 --> 00:08:52.560
David John Gagne: Early era we want to do something on all there were five or us on board subset of it.
77
00:08:54.930 --> 00:08:56.880
David John Gagne: To so there's like a lot more.
78
00:08:56.970 --> 00:08:57.420
David John Gagne: kind of.
79
00:08:57.510 --> 00:09:04.200
David John Gagne: A friend dena downloading and processing that's often required to do any kind of machine learning tasks.
80
00:09:05.430 --> 00:09:12.000
David John Gagne: In some areas where we have what kind of kind of gotten more established working with the data so that this becomes less of an issue but.
81
00:09:13.320 --> 00:09:14.520
David John Gagne: Still definitely in the.
82
00:09:16.050 --> 00:09:26.820
David John Gagne: lake lots of people wanting to apply machine learning to their own problems and figuring out that step, so there's a lot of piloting and building up of this data infrastructure, making things in already and and.
83
00:09:28.110 --> 00:09:28.830
David John Gagne: And whatnot.
84
00:09:30.750 --> 00:09:46.860
David John Gagne: When challenge of this is also like depending on where the data is located it's on glee, then think life is relatively easy for us, but then some some data sets that are entirely on the cloud, and maybe like we don't want to download the entire data set off the cloud on to.
85
00:09:46.890 --> 00:09:48.390
David John Gagne: enclave, you want to be able to.
86
00:09:48.540 --> 00:09:55.350
David John Gagne: stream it as needed and figuring out how to efficiently do that it's like it's an ongoing thing we're working on.
87
00:09:56.280 --> 00:10:11.760
David John Gagne: It it'll still I data on external data services that you know, like our system grid or NASA tape drives of satellite data that are where they're so big bottleneck and getting that getting the data from its original source.
88
00:10:14.790 --> 00:10:19.830
David John Gagne: I know there's like efforts outside what we're doing to to help with that, but but.
89
00:10:20.520 --> 00:10:24.870
David John Gagne: Those efforts also helping denture machinery tests being done on on from that kind of data.
90
00:10:26.310 --> 00:10:31.380
David John Gagne: there's also a lot of marketing on data conversion that we often have to do this is like things like interpolation.
91
00:10:31.380 --> 00:10:35.340
David John Gagne: regretting converting from, say, like a role.
92
00:10:36.480 --> 00:10:51.990
David John Gagne: Like domain specific format to something that's more portable like net CDF where's our sub setting relevant data for like grab what we need for the machine learning problem, maybe, sometimes we need to shuffle or overlay or.
93
00:10:54.270 --> 00:11:02.100
David John Gagne: Like reshape the the data, the data access patterns, sometimes be a bit different for like some machine learning tasks, then for a.
94
00:11:03.390 --> 00:11:07.290
David John Gagne: Like a simulation tasks, because we may need to grab like say every storm in a.
95
00:11:08.700 --> 00:11:13.620
David John Gagne: Climate simulation so there's like a object orientation aspect aspect of that.
96
00:11:14.850 --> 00:11:18.780
David John Gagne: or needing to do what random access across multiple years to build a big.
97
00:11:18.840 --> 00:11:21.000
David John Gagne: batch of data to go into a neural network.
98
00:11:23.070 --> 00:11:23.490
David John Gagne: and
99
00:11:23.820 --> 00:11:28.020
David John Gagne: And when doing that at scale that there's certain ball next to work around their.
100
00:11:30.510 --> 00:11:35.430
David John Gagne: Mother test often is done in the pre processing step is this kind of scaling standardization we're we're having to like.
101
00:11:37.830 --> 00:11:39.870
David John Gagne: calculate like me and standard deviation and.
102
00:11:41.460 --> 00:11:50.880
David John Gagne: convert the data into that into that format from its original raw values or do a log transform or or some other kind or calculate derived variables or.
103
00:11:53.040 --> 00:12:00.450
David John Gagne: And then we have to keep all that around or figure out some like there's like do we need another set of pre process data to work with.
104
00:12:04.530 --> 00:12:06.300
David John Gagne: And within that, how can the gpu.
105
00:12:06.300 --> 00:12:06.630
help.
106
00:12:08.010 --> 00:12:13.140
David John Gagne: Right now Moses, I also have described is pretty is pretty cpu based for the most part.
107
00:12:13.800 --> 00:12:19.680
David John Gagne: There are a few libraries that are coming out that the thing nvidia's has a rapids library that has like.
108
00:12:20.160 --> 00:12:30.630
David John Gagne: Has like a cool new American cudi F that supposed to replace the functionality know pine pandas and some of that might help with some of these tasks that are like more compute intensive rather than io intensive.
109
00:12:33.090 --> 00:12:37.740
David John Gagne: And then those I think there could be some additional benefit, where the gpus can be used if available.
110
00:12:38.430 --> 00:12:41.520
David John Gagne: The only concern courses, the I O bottlenecks that come along with.
111
00:12:41.610 --> 00:12:51.180
David John Gagne: me to load the data like it's already pretty high of the operations loading the data and from disk to memory or from downloading from the cloud to disk to memory.
112
00:12:52.740 --> 00:12:58.590
David John Gagne: The adding another step like how much of how much of an overhead costs is that.
113
00:13:00.090 --> 00:13:07.110
David John Gagne: Number also have some ability to have to work on the gpu and help convert Python code to more gpu wrong boca without.
114
00:13:09.570 --> 00:13:11.280
David John Gagne: out too much work, I think.
115
00:13:12.750 --> 00:13:21.690
David John Gagne: So that would be another thing to look into supporting but currently we're not using any of these like four.
116
00:13:21.900 --> 00:13:24.540
David John Gagne: And their gpu versions of this time.
117
00:13:27.360 --> 00:13:32.340
David John Gagne: On the other end of the spectrum is evaluation interpretation so.
118
00:13:35.250 --> 00:13:44.970
David John Gagne: In terms of hbc challenges, this one is when you're like training a model your visual I see how it's doing over it's time period to see if you need to stop training or.
119
00:13:45.480 --> 00:13:56.220
David John Gagne: Make sure it's actually converging TOR TOR TOR, like a lower value, instead of like sometimes you'd have something set up wrong, it will flatline second be a problem.
120
00:13:57.600 --> 00:14:00.210
David John Gagne: We also are doing large amounts of machine learning inference.
121
00:14:01.470 --> 00:14:06.780
David John Gagne: As part of the evaluation interpretation process, so making this efficient is quite useful.
122
00:14:09.750 --> 00:14:23.610
David John Gagne: When when things are starting to run into some some of the bigger deep learning models and things like units is that remember generating predictions or say i'll put like heat maps and sailing team Apps and things of that nature.
123
00:14:25.140 --> 00:14:32.850
David John Gagne: That that exercise itself creates a large amount of data, now we need to figure out how to best manage that the data we're generating not just the data we're reading in.
124
00:14:36.240 --> 00:14:44.430
David John Gagne: there's always ongoing challenges of like doing our offline evaluations versus putting the missionary mo in the client model or the weather model and running that.
125
00:14:44.910 --> 00:15:00.150
David John Gagne: and potentially even doing like a fully integrated kind of like train the model using simulation rooms, as part of the training pipeline we're not there yet, and in terms of doing doing that because the same world more integration to be done.
126
00:15:01.680 --> 00:15:09.780
David John Gagne: By people are definitely moving in that direction and it's a it's going to be use case will expect to see running on Andre CIO.
127
00:15:10.860 --> 00:15:14.400
David John Gagne: In the next you know easily and reaches lifetimes not before them.
128
00:15:15.720 --> 00:15:21.480
David John Gagne: And finally, think when challenged anything we're doing but it's always a challenge is.
129
00:15:22.200 --> 00:15:31.170
David John Gagne: Implementing scoring interpretation functions in like tensorflow and pie towards versus like pretty unknown pie or regular Python the you having to use like.
130
00:15:31.950 --> 00:15:39.120
David John Gagne: Suddenly different API can good causes what lots of productivity issues and friction and annoyances.
131
00:15:39.840 --> 00:15:57.990
David John Gagne: Because both tensorflow and paid for each other own array system under the hood that is mostly like i'm pie, but not exactly like comply and there were like tensorflow is exposed as an experimental dump API and torture thing works more friendly with them fiber not perfectly so.
132
00:15:59.010 --> 00:15:59.550
David John Gagne: The.
133
00:16:01.200 --> 00:16:15.780
David John Gagne: standardizing api's in general I think helps a lot with with some of these issues so So if you can just drop in saying as cheaper support gpu stuff more seamless as possible than than the increase the human productivity quite a bit.
134
00:16:17.130 --> 00:16:20.100
David John Gagne: And i'll turn it over to john to kind of talk about.
135
00:16:21.660 --> 00:16:25.380
David John Gagne: The the so our existing stuff going on.
136
00:16:27.240 --> 00:16:28.770
David John Gagne: So john Ray take it over.
137
00:16:29.340 --> 00:16:33.480
John Schreck: yeah D uh can I do want me to screen from you.
138
00:16:34.260 --> 00:16:36.120
David John Gagne: Just you can just tell the next slide.
139
00:16:37.320 --> 00:16:37.740
David John Gagne: that's fine.
140
00:16:40.920 --> 00:16:44.700
John Schreck: yeah that's fine um yeah so uh.
141
00:16:46.260 --> 00:16:54.840
John Schreck: I apologize, in advance, some of these slides maybe a little bit of rehash if you were at the whip talk a couple months ago but i'm trying to mostly kind of focus on what we're doing with gpus.
142
00:16:55.620 --> 00:17:03.270
John Schreck: So just to kind of start off a little bit here um one of the big things that we have to kind of worried about is selecting a model.
143
00:17:04.500 --> 00:17:09.270
John Schreck: So just to kind of give you sort of a generic idea of neural nets.
144
00:17:10.350 --> 00:17:17.490
John Schreck: The top diagram here kind of shows you on the Left sort of like the inputs to a neural network model Okay, and it could be.
145
00:17:18.000 --> 00:17:26.040
John Schreck: Numbers like floating point numbers are real numbers, it could be like images like kind of two dimensional arrays and they can be sequential data right like.
146
00:17:26.460 --> 00:17:31.530
John Schreck: Like like sentences, for instance, or like a sequence of floating point numbers okay.
147
00:17:32.010 --> 00:17:38.520
John Schreck: um so that gets passed through these neural network architectures and you know i'm kind of spare you the details just sticking with the illustration here.
148
00:17:38.820 --> 00:17:49.050
John Schreck: um, as you can kind of see it's sort of three main parts in the simplistic diagram it's the input, the sort of in between area hidden layers and then there's this model output okay and sort of.
149
00:17:50.280 --> 00:18:03.510
John Schreck: Roughly or you know generically the model, the data on the Left gets passed through the model and the model sort of kind of learn something about this data set okay and it's not necessarily clear to us, yet what these models are doing.
150
00:18:04.320 --> 00:18:07.560
John Schreck: But nevertheless, it will oftentimes perform well.
151
00:18:08.010 --> 00:18:11.970
John Schreck: For instance, what spinning out on the right side of this model of this diagram here.
152
00:18:12.210 --> 00:18:16.590
John Schreck: Okay, and just to kind of like remind you, the output could also be very similar to what the input was.
153
00:18:16.770 --> 00:18:26.250
John Schreck: Right, it could just be a single output so could just be like a number you're trying to predict like I don't know some like a mass or something like that or some numerical floating Point number, it could also be like a.
154
00:18:27.900 --> 00:18:33.000
John Schreck: integer number, such as like a one or two or three which you should think of as.
155
00:18:33.510 --> 00:18:42.780
John Schreck: A label, for instance, so like the input might be like an image of a giraffe right and the output is you want to pick the label that refers to giraffe and not all the other labels, you could have chosen from.
156
00:18:43.320 --> 00:18:49.800
John Schreck: Okay, so that's sort of an example of like a classification problem right um but you know sort of above all this really like.
157
00:18:50.430 --> 00:18:59.430
John Schreck: pendant of what you're kind of feeding into these models what's going on in between the models what's coming out of the models, you have to worry about certain settings in order to get this model to actually perform for you.
158
00:18:59.910 --> 00:19:09.390
John Schreck: Okay, so and actually I sort of just kind of generically for these things is hyper parameters alright, so in this example, it could be like well how many hidden layers you want.
159
00:19:09.870 --> 00:19:20.760
John Schreck: Okay, you have to choose that right or like if you have three hidden layers like well how many like nodes do you want right like the green dots do you want to each of the headlines doesn't have to be the same like it's drawn here okay so.
160
00:19:21.390 --> 00:19:35.730
John Schreck: A bad choice and I kind of put quotes around that basically leads you to a poor model and sometimes that's what a bad choice is right So how do you choose wisely I guess Okay, so that i'll get into that and what we're doing with the gpus mainly here on casper.
161
00:19:36.930 --> 00:19:39.960
John Schreck: yeah thanks a lot so uh.
162
00:19:40.380 --> 00:19:47.070
John Schreck: For the most part, a lot of the neural nets that we're dealing with tendons there's growing in size and David john kind of pointed out like you know that.
163
00:19:47.130 --> 00:19:52.800
John Schreck: there's a lot of the benchmarks they're sort of been reported are using certain models that are different sizes and doing different things.
164
00:19:53.730 --> 00:19:59.700
John Schreck: So, for the most part, like at some point, like you're going to have to wind up using gpus because training these models involves like loads of.
165
00:20:00.240 --> 00:20:06.360
John Schreck: Numerical calculations that need to be done on the gpu okay so i've kind of pointed out like two cases here really.
166
00:20:07.350 --> 00:20:14.610
John Schreck: The first one is sort of one i'm not really going to get into which is your model and your data doesn't fit onto one gpu so you need more than one gpu.
167
00:20:14.970 --> 00:20:20.040
John Schreck: Okay, and two pi torch and tensorflow and there's other libraries out there are these sort of the two most popular ones.
168
00:20:20.460 --> 00:20:31.320
John Schreck: They have been working a lot lately to really make it a lot easier to, for instance, you can break up your data your batches of data that you're passing through to the Left in this diagram coming out on the right and then fitting into the truth.
169
00:20:32.130 --> 00:20:41.010
John Schreck: That can actually be kind of broken up and handed off to a bunch of different gpus and that data can then become recombined to perform a single up wait update on the model.
170
00:20:41.700 --> 00:20:49.290
John Schreck: there's ladders what's more interesting to me in some senses Okay, the model in the day to fit on one gpu but how much of those resources are leftover.
171
00:20:49.680 --> 00:21:03.870
John Schreck: Okay, because right now, the way the system works on casper is that if you ask for a gpu you get all all of the memory that comes with it Okay, but it's very often the case that people use gpus and they're not using all 32 gigs, for instance on some of our notes.
172
00:21:04.920 --> 00:21:09.960
John Schreck: So can we use those leftover resources next slide please.
173
00:21:12.030 --> 00:21:19.680
John Schreck: Alright, so just to kind of go back into the first part about like, how do we pick the right hyper parameters, we get a good machine learning model so i'll try to kind of.
174
00:21:20.190 --> 00:21:28.410
John Schreck: go over some of the stuff relatively quickly um there's a few kind of main steps involved here, and you know the diagram on the Left kind of like illustrates, to some extent.
175
00:21:29.040 --> 00:21:37.920
John Schreck: You have some objective when you're training a model that you're trying to achieve right, but you want to minimize some quantity right and it's an example here is the mean absolute error okay.
176
00:21:39.060 --> 00:21:45.960
John Schreck: When you have to choose hyper parameters, you know i'm highlighting you know, for instance, the learning rate in the number of neurons right, so you might just pick like I don't know learning rate of.
177
00:21:46.440 --> 00:21:53.070
John Schreck: One 10th and 15 hit number neurons you train the model, you get in I mean absolute error that's a trial.
178
00:21:53.580 --> 00:22:00.810
John Schreck: Okay, so study, then, is when we do that whole bunch of time all right, we just pick different combinations of learning right and neurons we train them on and get a.
179
00:22:01.380 --> 00:22:09.420
John Schreck: optimization objective value at the end of the day, OK, so the way that you can sample these, for instance, is by just guessing right which is.
180
00:22:09.990 --> 00:22:13.710
John Schreck: If you're not very well informed on how the neural that's going to work that's random search.
181
00:22:14.100 --> 00:22:19.380
John Schreck: um we I will show you in a few slides later that we use a combination of random searching and then we use what.
182
00:22:19.710 --> 00:22:31.890
John Schreck: i'm just going to refer to as an informed search, which is really a gaussian mixture model which is based on bayesian statistics that helps to try to leverage observations that you've already made right that's basically like.
183
00:22:33.000 --> 00:22:38.520
John Schreck: What was a trial, what was the outcome and then tries to use that to pick a better next set.
184
00:22:40.680 --> 00:22:41.370
John Schreck: Next slide please.
185
00:22:43.380 --> 00:22:53.160
John Schreck: Alright, so um we've been working on a little package here called ECHO I think i've reiterated this before it's called earth computing hyper parameter optimization it's a distributed multi gpu approached hyper parameter opt.
186
00:22:54.120 --> 00:22:59.760
John Schreck: there's my github for it a little bit of a mess right now we're really we're trying to get an update out like pretty quickly here.
187
00:23:00.630 --> 00:23:05.880
John Schreck: Overall it's fairly we're trying to make it pretty easy to use, I don't in some sense, like a solo kind of hard to use.
188
00:23:06.510 --> 00:23:15.600
John Schreck: The new updates kind of try to make that a lot easier for other people and car who want to do machine learning but aren't necessarily like doing it every day right and it's like it just trying to get into it.
189
00:23:16.020 --> 00:23:23.010
John Schreck: So for now there's some dependencies i'm not going to get into it too much later on, I hope, to give you an option, where you can actually just upload your data set.
190
00:23:23.400 --> 00:23:28.380
John Schreck: Or you know, like Patrick data set to ECHO and it might even possibly be able to suggest models for you.
191
00:23:29.370 --> 00:23:35.280
John Schreck: Right now everything is sort of you have to kind of pick a model that you want to optimize on right and then like.
192
00:23:35.580 --> 00:23:44.040
John Schreck: ECHO will take over from there, but it's still up to you to sort of you know if it's a language processing model, you still have to pick the appropriate model next slide please.
193
00:23:46.350 --> 00:23:56.580
John Schreck: Alright, so the way that this sort of works here at n car is you know, for the most part we have casper and cheyenne casper is the one that has gpus but sometimes models are right at the cusp or you could use both.
194
00:23:57.660 --> 00:24:04.650
John Schreck: So the way that we kind of distribute all this is by initiating a database entry for a study.
195
00:24:05.220 --> 00:24:14.130
John Schreck: Okay, so I want to optimize model i'm gonna do a bunch of trials, so I will save all of that data in a database okay it's part of the trial or the studies.
196
00:24:14.430 --> 00:24:16.440
John Schreck: record okay so.
197
00:24:16.740 --> 00:24:27.330
John Schreck: That means that I can basically meet at all, I really need to be able to do is writing this database, which means I don't have to have everything on the same computer I could have stuff and you know, like South America, but I really wanted to as long as like a bring the database.
198
00:24:28.110 --> 00:24:36.570
John Schreck: So that basically means I can run as many trials, as I want simultaneously and Max out my resources okay and that's the objective here right and that's what the drawing kind of trying to show you.
199
00:24:36.990 --> 00:24:44.670
John Schreck: All right, so i'm just kind of showing you workers and this to kind of mean node okay on gpu or like you know if you ask for.
200
00:24:45.210 --> 00:24:55.170
John Schreck: You know, know you got to come specify what you want right you want one gpu you want a gpus whatever but that's when I say worker that's kind of what I have in mind, just for the purpose of the way I made my slides next slide please.
201
00:24:57.060 --> 00:25:04.350
John Schreck: So, within each worker or node right there's going to be a gpu let's just say there's one gpu I just asked for one gpu per worker.
202
00:25:04.620 --> 00:25:09.450
John Schreck: Okay, and let's just suppose that my model and the memory, you know, the amount of data and the model.
203
00:25:09.870 --> 00:25:24.390
John Schreck: there's going to be mounted on the gpu maximum only at once, is only ever going to be let's just say 140 total memory available well, that means that I can put four models and the data basically copying it four times on to the gpu okay and that's what ECHO does.
204
00:25:25.470 --> 00:25:29.520
John Schreck: So in this example here I mean it's kind of trivial math right it's like all right cool I have like.
205
00:25:30.420 --> 00:25:40.230
John Schreck: Three workers, I got four gpus eat, so I just now I have 12 things going at once Okay, the main problem here really is I it's tough to write a thing for me.
206
00:25:41.220 --> 00:25:46.230
John Schreck: To really accurately estimate like how much resources, you really going to need up front right and.
207
00:25:46.710 --> 00:25:53.250
John Schreck: Right now, the objective is probably just to leave it to you the user is like well how many times you want to try them out your model on the same gpu.
208
00:25:53.520 --> 00:26:02.010
John Schreck: Later on, it would be ideal to not so you don't have to do that at the end of the day, I really want echoed icon to be as simple as possible and to be as user friendly as possible because.
209
00:26:02.310 --> 00:26:11.760
John Schreck: A lot of these like nitty gritty like gpu tricks and things like that, like you, don't need most people care about that you don't need to know that you need to get your model number four and need to get the best model that you can get.
210
00:26:12.180 --> 00:26:18.390
John Schreck: Okay, and that's all you mean obviously that's ultimately our objective I just hit next slide on my keyboard but I guess that's.
211
00:26:19.560 --> 00:26:28.170
John Schreck: Alright, so i'll try to go through my example here holodeck is this is probably our project that really uses the gpus the most and to certain degree, so holodeck is a hologram.
212
00:26:29.640 --> 00:26:39.180
John Schreck: it's a harder detector um we have our collaborators here are Aaron Ben summer and matt heyman over and AOL So you can see here the detectors mounts to a plane.
213
00:26:39.810 --> 00:26:47.640
John Schreck: mounted multiple planes are suited for planes here, and you basically fly this thing around clouds Okay, and what it does is it's able to as it's.
214
00:26:47.970 --> 00:26:55.440
John Schreck: Written up at the top is you're able to determine size and the two dimensional shape three dimensional three dimensional position of High drama letters.
215
00:26:55.860 --> 00:27:04.320
John Schreck: For instance, in these clouds Okay, so that the main thing we're going to be interested in our small like liquid or like water particles, but i'll show you a few examples, at the end that was, like other stuff.
216
00:27:04.470 --> 00:27:11.910
John Schreck: Like in these holograms and some of the things we don't even really know what they are okay so next slide i'll give you a couple examples here of what this looks like.
217
00:27:12.510 --> 00:27:19.140
John Schreck: So the image on the left is a real example of a hologram that came from holodeck I just refer to it as the holodeck holograms.
218
00:27:19.440 --> 00:27:25.980
John Schreck: And I kind of zoom in on a particle there the particle is not in focus here it's some distance away like into the page from here.
219
00:27:26.430 --> 00:27:32.340
John Schreck: Okay, on the right is a simulated hologram Okay, you can see, is like a lot better looking at right it's.
220
00:27:32.940 --> 00:27:44.520
John Schreck: Even though that particle on the right there is not in focus it's like pretty obvious compare the one on the left, which is kind of fading and, if you look very carefully on the Left there's like weird stuff which is noise and other things that are you know that are.
221
00:27:45.840 --> 00:27:48.300
John Schreck: I don't want to say that their artifacts but they're there they're.
222
00:27:49.440 --> 00:27:53.100
John Schreck: These holograms are very large know that they're megapixels and sighs.
223
00:27:53.370 --> 00:27:58.470
John Schreck: Okay, and the little squares and actually have drawn a little Insects are actually 512 by 512 sub samples.
224
00:27:58.680 --> 00:28:09.420
John Schreck: Which is actually the size we're going to feed into a neural net, because even that's kind of big so there's no chance right now that you could just take this giant image and like feed it through neural net and let's not run out of gpu memory real quick.
225
00:28:10.500 --> 00:28:24.750
John Schreck: Among other things, so the current way that you extract these particles right now is this program called hollis sweet does not involve machine learning and it's based on some physics calculations, so our main question is, can we get a neural net to do better than that next slide please.
226
00:28:26.730 --> 00:28:37.560
John Schreck: So I had that example you kind of showed there before it's kind of clear that that's a plane right and that's the XY plane and then, if you're able to sort of like locate where the article is that you got X, Y.
227
00:28:37.950 --> 00:28:44.640
John Schreck: Okay, but Z is kind of stymied us for about a year and we actually kind of wound up sort of taken a little bit of a.
228
00:28:45.240 --> 00:28:54.750
John Schreck: play out of the hollow sweet playbook, which is to take advantage of something called wave propagation and all that really means is is that hologram is an electromagnetic field that we're looking at it's just been.
229
00:28:55.590 --> 00:28:59.250
John Schreck: processed in a way that we can actually make sense of it with our eyes.
230
00:29:00.000 --> 00:29:12.240
John Schreck: But as such it's governed by laws of physics and we can take advantage of those to basically take that image and reconstructed that hologram at some other distance eat into the plane okay.
231
00:29:12.810 --> 00:29:19.740
John Schreck: So, at a certain distance is the particles become in focus right now in order for us to take advantage of that right to.
232
00:29:20.580 --> 00:29:33.450
John Schreck: way property some Z and we say oh yeah let the particles in focus there so that's where it is Z, we have to take advantage, we have to use for a transforms, in fact, we have to do one, and then we have to do an inverse for a transform after that, in order to get away from.
233
00:29:34.590 --> 00:29:40.680
John Schreck: The pictures just kind of show you two different ways of how the waves can come in and we actually take advantage of the one on the left, where the.
234
00:29:40.920 --> 00:29:47.460
John Schreck: hologram thing is like kind of just shooting out sort of like playing incident ways we don't have like a radio Lee emanating detector like the one on the right.
235
00:29:48.600 --> 00:30:01.380
John Schreck: So I wanted to note that, as of pie charts 1.9 they started supporting fast fourier transform we've got a written in ourselves but it's a lot easier to just use their you know method that just just call it, and it was already natively gpu.
236
00:30:02.820 --> 00:30:11.790
John Schreck: You know oriented so, in other words, is done on the gpu Okay, and this is part of in some sense like data prep processing, not so much the model yet.
237
00:30:12.930 --> 00:30:13.710
John Schreck: Next slide please.
238
00:30:15.570 --> 00:30:18.600
John Schreck: Let me know, by the way, Brian if I run out of time or you need me to stop.
239
00:30:19.680 --> 00:30:27.540
John Schreck: So here's a model Okay, and the diagram on the left just want to show you that's the same little insect pictures that I showed in the previous those out of focus particles.
240
00:30:27.870 --> 00:30:32.880
John Schreck: And then the little sub panel on the right of that shows me during that wave prop calculation on the gpu.
241
00:30:33.390 --> 00:30:40.980
John Schreck: To the Z where the particles and focus and when you can smell it's just like this nice little dark dark right and that's the particles diameter that you're looking at.
242
00:30:41.460 --> 00:30:47.970
John Schreck: Okay, so that's what we want to estimate right, we want to be able to get the X, Y and the plane, and we want to be able to wave prop to a Z that's pretty close.
243
00:30:48.180 --> 00:30:53.460
John Schreck: Right, where the park looks like it's in focus and then we can get an estimate of that diameter okay so.
244
00:30:54.120 --> 00:31:02.940
John Schreck: These are the things that go into a neural net i'm not going to give you any details on it other than its large it's a unit, you know what it is units and other types of models can output.
245
00:31:03.630 --> 00:31:06.900
John Schreck: Basically, like image types, but i'm going to refer to as a mask here.
246
00:31:07.200 --> 00:31:15.750
John Schreck: And that mask on the right side is basically just binary right so it's predicting like zeros where there's no particle and focus and ones where it isn't focus right so it's a little.
247
00:31:16.230 --> 00:31:29.370
John Schreck: circle that i'm trying to draw around a circle i'm trying to like kind of scratch in right like basically like where the particle is so once I have a good mass prediction on the right, like, I can back out the diameter friant next slide please.
248
00:31:31.200 --> 00:31:44.850
John Schreck: um alright so here's me echo optimization of this model Okay, and I will i'll spare you the details, I tried to optimize quite a good number of hyper parameters, I tried a bunch of different types of objective loss functions.
249
00:31:45.330 --> 00:31:55.710
John Schreck: I tried a whole range of segmentation models and I varied the types of pre train layers in the encoder That was a hyper parameter as well i'm.
250
00:31:56.940 --> 00:32:08.160
John Schreck: Details the side, you can see that I drew a horizontal or vertical green dashed line that basically separates the random selection phase of hyper parameters from the informed selection.
251
00:32:08.520 --> 00:32:22.380
John Schreck: Right it's pretty obvious right to the left, like the blue dots are just kind of all over the place, and then, as soon as I, you know the model starts trying to take advantage of what it's already not the model, but the you know our optimization objective or package is trying to.
252
00:32:23.430 --> 00:32:30.780
John Schreck: leverage the trial that it's already seen to make better choices and that's pretty obvious right like look at how the blue dots kind of collapse.
253
00:32:31.140 --> 00:32:36.810
John Schreck: Like especially around 130 and that keeps going down a little bit, but we didn't see too much improvement and for about 250 so.
254
00:32:36.990 --> 00:32:50.400
John Schreck: This is not cheap right, I mean this is already taken advantage of that for X for for models for gpu does more than 400 trials here there's actually more than that i'm not showing all of them I don't know how many gpu hours this took I did it when you all weren't using it.
255
00:32:51.570 --> 00:32:52.470
John Schreck: Next slide please.
256
00:32:54.000 --> 00:33:02.280
John Schreck: So how do I actually use this thing once it's train, I noted that you have to wave prop to a Z so we don't know where the particles are so we have to just wait for up to a whole bunch of different.
257
00:33:02.580 --> 00:33:09.030
John Schreck: z's and and try to figure out if there's particle there okay so that's a choice, the number of Z planes that are more to reconstruct the hologram at.
258
00:33:09.390 --> 00:33:16.410
John Schreck: Every time I do that, like, I have to break down that large hologram and those little 512 by 512 cells pass each one to the model performing reconstruction.
259
00:33:16.650 --> 00:33:24.690
John Schreck: And then I wind up with a full size prediction for masks that each zed value okay i've left out all those details on how I.
260
00:33:24.900 --> 00:33:31.140
John Schreck: kind of shrink down I don't shrink down the full size image I subsample it and then I reconstruct it with the model outputs back into the full.
261
00:33:31.620 --> 00:33:34.710
John Schreck: Size I don't really do much on the GP with that, so I kind of just left it out.
262
00:33:35.580 --> 00:33:41.280
John Schreck: So now you with this pipeline is processed right here is that all the Z all the planes going into the model or I.
263
00:33:41.490 --> 00:33:49.140
John Schreck: mean it doesn't matter like what order they go in right like cm can go first and then he wanted to go later so that just means that I can do them all at once, if I have the resources.
264
00:33:49.440 --> 00:33:56.520
John Schreck: Okay, so I mean I did this on purpose right, so that the algorithm is scale because hollis we cannot scale, not the way that it was written so.
265
00:33:57.420 --> 00:34:05.130
John Schreck: If I have all the gpu you know, especially into Rachel comes around which I have even more resources, I can really push like how fast, I can process holograms.
266
00:34:05.760 --> 00:34:11.850
John Schreck: So just to say right now, it takes about two to five minutes per hologram because it's kind of a lot of processing.
267
00:34:12.270 --> 00:34:21.930
John Schreck: As David john kind of notice there's there's a ton of data that's generated like an intermediary phases here right of things going into the model and then, just like having a nice clean output list of.
268
00:34:22.440 --> 00:34:29.700
John Schreck: particles and, like their locations right which is ultimately what we want, so there's a lot of like difficulty and you know it's really more of like a.
269
00:34:30.390 --> 00:34:35.460
John Schreck: It would take I don't really know what to say about it there's a lot of optimization that could still probably be done.
270
00:34:35.910 --> 00:34:46.410
John Schreck: That you know it would it's gets into sort of interesting usages of gpu and so on, and I will know too that I take advantage of my little like i'm going to melt more than one G or model to a gpu.
271
00:34:46.770 --> 00:34:55.860
John Schreck: When I do this thing over here, I tried to take advantage of every possible resource, I can before I crashed the node okay um next slide please.
272
00:34:57.510 --> 00:35:09.870
John Schreck: So these are just some predictions here of the X, Y and Z in the D coordinates now when I run them all those planes through the model at different disease we actually have to do another post processing.
273
00:35:11.130 --> 00:35:15.960
John Schreck: You know, a calculation that i'm not going to show you because it just gets into too many details and it's already 230 past 230.
274
00:35:16.320 --> 00:35:21.600
John Schreck: So just to say that we'd perform a clustering routine and we use a distance threshold in order to perform that clustering.
275
00:35:22.080 --> 00:35:31.020
John Schreck: That allows me in some sense to kind of like toggle how many particles we actually predict, and then we can align them up with the truth articles which was the simulated holograms that's what this data is.
276
00:35:31.950 --> 00:35:41.070
John Schreck: An overall, you can see, pretty strong agreement it's about an 86% match rate here I picked a threshold value that gave me that on purpose, because i'll get into that next slide.
277
00:35:41.550 --> 00:35:47.040
John Schreck: You can fiddle around with that distance distance threshold, a little bit to kind of toggle the performance that you want.
278
00:35:48.060 --> 00:35:54.240
John Schreck: So just to give you a little example here this gabrielle gatos wrote a really nice visualization script here to help us.
279
00:35:54.750 --> 00:36:03.420
John Schreck: visualize very large number of particles and 3D visualize or here so on the left is the true and the right is the predicted, this is just one random hologram that I picked.
280
00:36:04.110 --> 00:36:09.780
John Schreck: If you look very careful, you can see some differences, I mean it's not perfect, but it was only 86% match in this case.
281
00:36:11.010 --> 00:36:19.890
John Schreck: But it's pretty good right in fact you saw on the pre I didn't I didn't point out in the previous example yeah thanks David john, but if you look at the bottom right the D word an average histogram.
282
00:36:20.250 --> 00:36:26.970
John Schreck: We mostly just suffer a little bit at predicting the smallest particles and again i'm going to tell you that I kind of had to do that on purpose.
283
00:36:27.210 --> 00:36:33.510
John Schreck: I had to introduce noise and as a training in order to get the neural nets to perform better on the real data which had noise in it.
284
00:36:33.990 --> 00:36:44.370
John Schreck: So that noise actually like causes us to make it a little bit harder to predict the smallest particles i'm not going to get into details, but to talk about that mood another time Dave junkie go to slides ahead, please.
285
00:36:46.410 --> 00:36:51.270
John Schreck: So I just wanted to point out now this last table here is a lot of numbers, not going to make you look at all of them.
286
00:36:51.510 --> 00:37:01.440
John Schreck: This is just a comparison between what the neural nets predicting and now the real holograms so all the results that I showed, you were on those perfect synthetic holograms now here's us.
287
00:37:01.800 --> 00:37:14.460
John Schreck: head to head comparison against hollis sweet, and right now what i'm going to do is, we had to manually label these examples Okay, the way that we did that was we took all the predictions from hollis we all the predictions from the neural net and then myself.
288
00:37:15.240 --> 00:37:24.330
John Schreck: matt heyman air vents hammer and gabrielle cantos we manually label them Okay, and the label was a one or a zero and the one was basically what's the particle and focus in my opinion.
289
00:37:24.870 --> 00:37:31.110
John Schreck: And that opinion came with a ranking one to five, and confidence okay so we're not trying to do the xyz D here we're just trying to say.
290
00:37:31.680 --> 00:37:43.320
John Schreck: Which one is right, because how this way it's not the truth Okay, and you can actually see the very bottom the boldface numbers are the ones that I want you to look at just look at the accuracy it's 88 to 69% so we beat our sweet by.
291
00:37:44.580 --> 00:37:45.510
John Schreck: 19 points.
292
00:37:46.770 --> 00:37:47.700
John Schreck: Next slide please.
293
00:37:49.470 --> 00:37:53.940
John Schreck: So these are just a couple examples of who's getting what wrong the top.
294
00:37:54.780 --> 00:38:03.690
John Schreck: row just shows you what hollow suites getting wrong, so these first two examples here are like one of them called a wave reflection so just something bounced off, I think the detector.
295
00:38:03.990 --> 00:38:10.200
John Schreck: And these goofy like funky looking patterns are on the top left one is not a tree particle the one right next to that one is actually an artist.
296
00:38:10.530 --> 00:38:17.400
John Schreck: fact know how there's like a little bright and next to it dark I don't know what it is we just I was informed by the more expertly.
297
00:38:17.670 --> 00:38:24.570
John Schreck: matt damon and an Aaron who are the real hologram guys here they know what these things are, that is not a real particle that we're interested in.
298
00:38:25.140 --> 00:38:30.570
John Schreck: The two after that show particles that we labeled to be true that hollis we just didn't get right for some reason.
299
00:38:31.200 --> 00:38:33.420
John Schreck: The bottom row shows the examples of the neural net.
300
00:38:33.810 --> 00:38:40.650
John Schreck: So the two there on the left or just kind of like he's blurry or half moon looking ones, and we really weren't sure like what to make of them like are they part.
301
00:38:40.950 --> 00:38:46.710
John Schreck: They could have had a particle that's like on a slightly out of focus and these were examples that were kind of near the edge of the holograms.
302
00:38:48.630 --> 00:38:54.840
John Schreck: Just to say like when you do manual labeling like an ain't perfect and like this is why we had to associate sort our confidence score.
303
00:38:56.250 --> 00:38:59.880
John Schreck: Now the two to the right of that are just examples of the neural net itself.
304
00:39:00.780 --> 00:39:06.210
John Schreck: right that the one the bigger one there's kind of blurry right it's not completely dark there's like a little bit of a bright spot in the middle.
305
00:39:06.660 --> 00:39:13.080
John Schreck: And the one over there on the way over there on right still has more like a noisy pattern in it, you kind of see like he's waving looking I don't know what to call them but.
306
00:39:13.410 --> 00:39:19.080
John Schreck: it's a noise pattern it's a very small particles so it's ones that we know the model is already going to have a little hard time predicting.
307
00:39:20.130 --> 00:39:20.820
John Schreck: Next slide please.
308
00:39:22.410 --> 00:39:22.800
John Schreck: That.
309
00:39:24.210 --> 00:39:29.970
John Schreck: I hope I didn't take up too much time, so just want to thank everyone who was involved in all this um yeah.
310
00:39:32.700 --> 00:39:35.430
Brian Vanderwende: Thanks guys yes perfect perfect timing.
311
00:39:36.600 --> 00:39:41.640
Brian Vanderwende: looks I mean we got plenty of time for if people have questions they want to feel towards.
312
00:39:43.470 --> 00:39:45.930
Brian Vanderwende: towards either of either of these aspects.
313
00:39:47.250 --> 00:39:48.240
Brian Vanderwende: john Dennis go ahead.
314
00:39:49.920 --> 00:39:55.530
John Dennis (he/him): yeah I was curious you were talking about all these calculations that you're performing.
315
00:39:56.760 --> 00:40:01.680
John Dennis (he/him): I assume, does the code run on cpu as well, and do you have comparisons.
316
00:40:05.580 --> 00:40:10.470
John Schreck: If I ran it on the cpu like the models you mean like actually evaluating the neural nets.
317
00:40:11.160 --> 00:40:21.990
John Dennis (he/him): Well, I mean I i'm just curious is the is, this is a gpu saving you a factor of four factor of two execution time.
318
00:40:23.010 --> 00:40:31.410
John Schreck: I mean, I would say, overall, I mean these models are so large, and I mean the data even handling these images are it's not an easy thing to do.
319
00:40:32.370 --> 00:40:36.930
John Schreck: I mean if everything I had no gpus at all, like the the project would not be able to be you couldn't do it.
320
00:40:37.470 --> 00:40:51.030
John Schreck: Like you need the gpus for the machine learning parts right for the neural net input output, other than that I, for the most part, like all that kind of pre processing and post processing, that is not wave propagation is mostly done on the cpu.
321
00:40:54.540 --> 00:40:55.680
David John Gagne: tap tap on that.
322
00:40:57.690 --> 00:41:05.700
David John Gagne: It depends on the depends on the size of the model, but by yeah the gpu can provide like 100,000 next.
323
00:41:07.350 --> 00:41:21.210
David John Gagne: Point pretty big orders of magnitude, to be a bit over like a single cpu obviously you could like do a distributed training across many cpus and the difference would decrease a fair bit.
324
00:41:22.800 --> 00:41:26.850
David John Gagne: But then you like it does increase the complexity of the machine learning workflow.
325
00:41:28.740 --> 00:41:32.820
David John Gagne: And power consumption and stuff like that, so the gpu definitely has a.
326
00:41:34.410 --> 00:41:51.300
David John Gagne: big advantage and i'm from a from the software perspective on the lake with things like spy torchy tensorflow you can run the exact same code or nearly the exact same code on cpu or gpu so he so it's possible to do these kinds of of of comparisons.
327
00:41:52.560 --> 00:41:58.560
David John Gagne: done it for like our goes benchmark I it's been a while, since, since I.
328
00:41:59.850 --> 00:42:04.380
David John Gagne: Looked at the remember the exact numbers, but it is definitely in the.
329
00:42:07.110 --> 00:42:09.720
David John Gagne: Probably one one cpu to to.
330
00:42:12.060 --> 00:42:22.140
David John Gagne: The training was fairly basic convolution neural network all on all like one cpu for it versus one like the 100 it was.
331
00:42:23.520 --> 00:42:27.450
David John Gagne: going from like a couple hours to a minute so.
332
00:42:29.250 --> 00:42:36.030
David John Gagne: And then add multiple gpus you can you can scale pretty well on a single node with the distributed and training.
333
00:42:39.630 --> 00:42:45.990
Brian Vanderwende: said, there was a follow up question in the chat from Brian he asks what precision are the data calculations and.
334
00:42:48.690 --> 00:42:49.710
David John Gagne: I think for.
335
00:42:51.120 --> 00:42:53.490
David John Gagne: All of our stuff is employed 32.
336
00:42:54.840 --> 00:42:55.410
David John Gagne: But.
337
00:42:56.490 --> 00:43:01.980
David John Gagne: Like I know in tensorflow and pie, tortures and support for like automatic mix precision, so you can.
338
00:43:02.880 --> 00:43:16.200
David John Gagne: it's fun the newer features I don't know they have turned on my only thing is turned on by default, though, but that can allow you to like us reduce precision, where it makes like epic have the tensorflow reply to figure out when to use reduce precision.
339
00:43:18.300 --> 00:43:22.530
David John Gagne: there's certainly a lot of people looking into ways to do that to maximize.
340
00:43:24.060 --> 00:43:29.790
David John Gagne: performance and reduce data storage and stuff like that john do you have anything to have a lot.
341
00:43:30.030 --> 00:43:32.340
John Schreck: yeah actually because the output, when the model like output.
342
00:43:32.670 --> 00:43:43.800
John Schreck: These masks um you know I said and output zeros and ones, but it really does is it outputs a number between zero and one and if it's less than, for instance, one half you label it zero it's greater than one half illegal one.
343
00:43:44.340 --> 00:43:58.830
John Schreck: But if I want to say about like all that data for like 1000 planes right and I want to save that with precision it's a tremendous amount I filled up my quota like multiple times so actually when I when I say that out only keep three significant figures.
344
00:43:59.940 --> 00:44:06.180
John Schreck: And it's just sort of a choice, but it's something that I needed to do if I wasn't going to keep running out of space.
345
00:44:07.440 --> 00:44:19.710
Brian Dobbins: So just to follow up on that, then so that's for the data for the calculations Is this something where you can go to or is that too low, precision i'm i'm just curious about the performance and the precision and you know how this all works.
346
00:44:20.250 --> 00:44:23.670
John Schreck: I suspect it's probably would be not too bad really probably comparable.
347
00:44:24.150 --> 00:44:31.080
John Schreck: I originally wanted to use the original like integer inputs right because the image was like zero to 255 pixel counts, but.
348
00:44:31.350 --> 00:44:39.720
John Schreck: We needed to do pre processing transformations and I noted that I had to add noise into the data during the training to get a performant on the real stuff.
349
00:44:40.080 --> 00:44:49.350
John Schreck: And that kind of throws off your ability to kind of stick with you know, unless you round which I haven't tried it but i've read some blogs, and things out there, where.
350
00:44:49.890 --> 00:44:59.040
John Schreck: It seems like it's worth the trade off right to not use the floats, especially in the input to a model, but I haven't tested that yet with like this particular project.
351
00:45:00.840 --> 00:45:02.430
David John Gagne: It may be worthwhile to do.
352
00:45:02.430 --> 00:45:03.300
David John Gagne: down the road as.
353
00:45:04.650 --> 00:45:07.650
David John Gagne: Well, maybe even test the automated next precision kind of.
354
00:45:08.670 --> 00:45:09.900
David John Gagne: frameworks and see if that.
355
00:45:11.670 --> 00:45:19.710
David John Gagne: i've seen other things that gives you can give you a pretty significant speed up just just turning that on without much of a loss and performance with anything.
356
00:45:21.870 --> 00:45:23.610
David John Gagne: that's like Thomas has a question.
357
00:45:25.290 --> 00:45:37.980
Thomas Hauser: yeah I think john I think you, you mentioned kind of I think you are bottleneck by the gpu availability is that did I understand it correctly, and how much could, if you have more gpu is how much would that speed up your work.
358
00:45:39.930 --> 00:45:53.700
John Schreck: i'm quite I mean, in this case for holiday probably a lot now I kind of cherry picked our holiday project because it is our gpu workhorse project, I would say, for now, I mean will be getting into heavier stuff later um.
359
00:45:54.810 --> 00:45:56.880
John Schreck: I mean, in the case, so the operational.
360
00:45:58.110 --> 00:46:06.090
John Schreck: The way that, like the AOL folks might use this would be to do 1000 constructions of the wave prop So if I had, for instance 1000 gpus.
361
00:46:06.630 --> 00:46:15.630
John Schreck: Then everything is done in one go right like it's and there's further parallelization that can happen, but I just didn't have enough time to get around to.
362
00:46:17.790 --> 00:46:18.030
John Schreck: Like.
363
00:46:18.720 --> 00:46:25.680
John Schreck: It would be great to use all of the ratio to try to process holograms like massively and try to tell the plane, where you go in real time.
364
00:46:27.150 --> 00:46:31.200
Mick Coady: Joe that this is MC I missed what you said how many more gpu.
365
00:46:32.070 --> 00:46:32.910
Mick Coady: You just mentioned me.
366
00:46:32.970 --> 00:46:39.720
John Schreck: In this case, we're in I usually what i'll say 500 because I can put two models and data on a gpu but I mean like you know.
367
00:46:40.860 --> 00:46:42.210
John Schreck: amazon's got that many right.
368
00:46:43.200 --> 00:46:44.760
John Schreck: But you know just to give you that idea.
369
00:46:44.790 --> 00:46:48.780
John Schreck: would be like hundreds to maybe 1000 in this particular case.
370
00:46:49.020 --> 00:46:50.670
John Schreck: And i'm not like a.
371
00:46:51.720 --> 00:47:03.180
John Schreck: You know I mean i've been writing code for a while, but i'm not like a software engineer by trade really so surely someone else is going to do a better job in some sense in certain areas I think in this code, where they have more experienced than me.
372
00:47:04.230 --> 00:47:07.200
John Schreck: Certainly I think of data prep and data handling.
373
00:47:09.090 --> 00:47:10.920
John Schreck: I know I should probably be using like our.
374
00:47:12.900 --> 00:47:23.190
John Schreck: I won't get into it right now, but I just to say I feel like even just with this pipeline like there could be like a number of cool things that we could probably do to it to like take advantage of more resources and parallelization.
375
00:47:25.380 --> 00:47:34.590
Brian Vanderwende: You refer to the you know this project if you were course you know when when when you're starting an ml project, and you have to make the decision whether the.
376
00:47:35.790 --> 00:47:43.830
Brian Vanderwende: four components that can be run on the gpu decide whether to use cpu or gpu resources is that more of a technical decision, right now, or is that more domains decision.
377
00:47:45.840 --> 00:47:59.460
John Schreck: Well, I mean, in the case here like I like I can do the wave prop calculation on the cpu or on the gpu it's quite fast on the gpu but not, these are still not be raised up pretty fast calculation or c++ in the back end anyways um I think.
378
00:48:01.380 --> 00:48:02.040
John Schreck: So, like.
379
00:48:03.300 --> 00:48:14.190
John Schreck: There are complications, for instance, where i'm trying to put too many things on the gpu and the gpu will be like nah, and this is kind of what I was getting into earlier about like me having sometimes a hard time estimating how many resources, I need up front.
380
00:48:15.870 --> 00:48:26.220
John Schreck: So there i'll sacrifice a performance right in some cases i'll just do the way for up on the cpu right and just wait, you know and it's not that big of a deal.
381
00:48:27.840 --> 00:48:33.960
John Schreck: i'm there i'm specifically alluding to like when i'm actually training a model and you saw the result was where I was like yeah right, you know I mean.
382
00:48:35.130 --> 00:48:42.180
John Schreck: sped it up and and a bunch of different other ways, and in some sense like just trying to train this model once while really I didn't like 500 times, but I think the best.
383
00:48:42.810 --> 00:48:50.880
John Schreck: Right and like, but in some sense like I don't have to keep doing that, over and over and over right, so I just dealt with it in that case.
384
00:48:52.650 --> 00:48:53.850
John Schreck: took the performance it yeah.
385
00:48:54.750 --> 00:48:55.530
Brian Vanderwende: That makes sense.
386
00:48:56.910 --> 00:48:59.040
David John Gagne: and probably a more general context it's.
387
00:49:00.660 --> 00:49:19.260
David John Gagne: The way, as I mentioned that kind of the beginning, I think the the best supported geek pieces with machine learning is kind of the brain training inference work workflow but There certainly are fiercely more opportunities to use it and the other parts of the pipeline it's just.
388
00:49:20.640 --> 00:49:32.850
David John Gagne: Like reasons we haven't done is because the the software to support it is relatively new like the rapids kodiak base announced numeric two weeks ago, so we haven't had a chance here not too bad yeah.
389
00:49:34.830 --> 00:49:36.060
David John Gagne: And and.
390
00:49:37.140 --> 00:49:44.280
David John Gagne: kind of numb pine pandas like the data processing on cpu based like it it's relatively straightforward right and.
391
00:49:45.390 --> 00:49:48.090
David John Gagne: Like haven't seen as much of the need for the speed up there, but.
392
00:49:48.900 --> 00:49:58.500
David John Gagne: whether I could see being more useful like like interpolation, for instance, or regretting that's a it's a very intensive calculation, I can be pretty slow on.
393
00:49:59.400 --> 00:50:06.840
David John Gagne: cpu based off that and some of it can be paralyzed i'm sure, and could benefit from the, by the way the gpu could do for it.
394
00:50:07.230 --> 00:50:15.810
David John Gagne: And it's something we're gonna have to do and for scaling up to bigger bigger machine learning problems like trying to run and say you know over the globe, or something, then.
395
00:50:19.650 --> 00:50:22.200
David John Gagne: I think it's the direction we need to look into at least.
396
00:50:23.700 --> 00:50:33.120
David John Gagne: The hard part I think for our I guess from from like there's more people support problem it's just like it's like these are additional things that people have to learn and.
397
00:50:34.620 --> 00:50:37.050
David John Gagne: supporting us in our group, like.
398
00:50:39.300 --> 00:50:49.590
David John Gagne: When we were learning a lot and pick up a lot of different different skill sets so it's like what do we do prioritize and learn more about the domain science or more about like this others just have a gpu.
399
00:50:50.550 --> 00:51:03.360
David John Gagne: Like package or so so from our perspective, and then we want to support like say other scientists out there on a car that are machine learning experts are not like uber Python power users, then.
400
00:51:05.010 --> 00:51:23.250
David John Gagne: The adding having to deal with the gpu and to the next word way that's not like hidden under the hood by layers of abstraction maybe maybe may not be a good sell because it's because of the extra overhead and we were uncomfortable yeah.
401
00:51:24.840 --> 00:51:29.010
Brian Vanderwende: I saw seen a put her hand up briefly, I do still have a comment or question.
402
00:51:32.790 --> 00:51:32.970
Cena: yeah.
403
00:51:34.230 --> 00:51:36.450
Cena: Maybe missed it but i'm.
404
00:51:37.590 --> 00:51:41.190
Cena: For that data augmentation and the noise adding.
405
00:51:43.380 --> 00:51:45.750
Cena: So, was that on gpu or cpu.
406
00:51:47.100 --> 00:51:47.790
John Schreck: noise.
407
00:51:51.390 --> 00:51:58.320
John Schreck: That actually did on the on the cpu, and the reason why is because, when i'm prepping the data I parallelize that.
408
00:51:58.830 --> 00:52:04.800
John Schreck: And I need to do that on the cpu if I do that on the gpu and Malta, and also their stuff it's too many things to handle for me.
409
00:52:05.310 --> 00:52:16.980
John Schreck: So I that's like another area where i'll like Okay, if I did it on the gpu it's like a worth maybe eight cpu workers, and I would have otherwise spawn so i'll just ask for 16 you know or something like that to kind of balance that out.
410
00:52:17.580 --> 00:52:22.290
John Schreck: It would be nice to be able to do everything on the gpu though right it's just going to give john says.
411
00:52:22.860 --> 00:52:32.760
John Schreck: You know the machine learning pipelines, you know they get caught sort of complicated, especially in this case right and like even just sort of keep a track like what's going on, sometimes is is a bit of a challenge.
412
00:52:35.580 --> 00:52:39.120
Brian Vanderwende: And then I think we'll wrap up this topic with some brief.
413
00:52:41.190 --> 00:52:53.400
Supreeth Madapur Suresh: yeah Thank you, thank you for the presentations I have a quick follow up or DJ maybe I misunderstood this did did you say you tried the rapids or you haven't tried or you're planning to try.
414
00:52:54.420 --> 00:52:56.640
David John Gagne: It we have I haven't tried it yet.
415
00:52:59.580 --> 00:53:16.320
David John Gagne: it's not like to experiment with it, but we don't have like a defined plan of like burgers sort of in Berlin dealing with a backlog of other stuff but but it's on my radar to try, or maybe get someone in the group to be like hey we got a company first class coming in, so.
416
00:53:17.580 --> 00:53:25.080
David John Gagne: Maybe you could test for them to to mess around with some some other stuff and see if they can incorporate into their workflows.
417
00:53:26.100 --> 00:53:28.950
David John Gagne: There were 12 for them, then we you know deploy them will rightly.
418
00:53:31.170 --> 00:53:46.110
Supreeth Madapur Suresh: Okay, because me and seen that we have been working with trap it's for a couple of years now, so if if you need any help with that are any advice would be happy to work with you guys are the new postdocs.
419
00:53:46.710 --> 00:53:48.510
David John Gagne: yeah definitely appreciate that.
420
00:53:50.190 --> 00:53:53.250
David John Gagne: Which parts are rapids have you been have you been using.
421
00:53:54.150 --> 00:54:05.070
Supreeth Madapur Suresh: Lately, the data frames and co pi individually inside that but we always wanted to try some of the machine learning functions inside the rabbit.
422
00:54:05.790 --> 00:54:16.950
Supreeth Madapur Suresh: That we didn't have any code wet So if you have a small example we could try it that's all our time our if you have new postdocs coming in, will be happy to work with them.
423
00:54:18.210 --> 00:54:26.100
David John Gagne: yeah certainly try to try to follow up with the sat there with the botox starting in early January so so you try to follow up after the holidays, I think.
424
00:54:28.950 --> 00:54:34.680
David John Gagne: But with the ACER kuti how much of a speed up have you seen using that versus like pandas.
425
00:54:35.880 --> 00:54:40.860
David John Gagne: And is there any like pain points are a challenge to you and getting it to work.
426
00:54:42.630 --> 00:54:55.980
Supreeth Madapur Suresh: Actually, we tried this for the first time with a soapbox student and we have the presentation, with a nice didn't lead ads about the performance and also what, what are we have tried i'd be happy to share that presentation with you, after the call.
427
00:54:56.460 --> 00:54:57.240
David John Gagne: yeah that'd be great.
428
00:55:00.000 --> 00:55:04.830
Brian Vanderwende: Alright well thanks john thanks David john for sharing the presentation.
429
00:55:06.540 --> 00:55:09.360
Brian Vanderwende: and appreciate the cut the questions from everybody in the group.
430
00:55:10.380 --> 00:55:30.210
Brian Vanderwende: So I think i'll skip the the training topic, right now, I think we have a lot more to say about that, and the next gtd, but I just wanted to leave a couple minutes for Roundtable if anybody has any sort of other comments, either on the machine learning subject or on gpu in general.
431
00:55:35.040 --> 00:55:37.590
Brian Vanderwende: Now we're right up against our time limit so that's fine too.
432
00:55:40.260 --> 00:55:41.160
Mick Coady: Well done burn.
433
00:55:41.640 --> 00:55:44.190
Brian Vanderwende: yeah yeah at a time to do with it.
434
00:55:46.020 --> 00:55:55.770
Mick Coady: Thanks DJ and john I didn't get to catch much, but I plan to listen in on the recordings so.
435
00:55:56.790 --> 00:55:58.860
Mick Coady: really appreciate the effort it looks really good.
436
00:56:00.180 --> 00:56:02.310
David John Gagne: Thank you mess up your.
437
00:56:03.900 --> 00:56:05.490
David John Gagne: plumbing issues are dangers.
438
00:56:06.840 --> 00:56:13.020
Mick Coady: There there close neck now just got figure, how to pay for it, I might not be able to retire now I don't know.
439
00:56:15.840 --> 00:56:18.270
Brian Vanderwende: Some that scary thought will wrap up the meeting.
440
00:56:19.020 --> 00:56:19.380
Mick Coady: i'll just.
441
00:56:20.130 --> 00:56:28.350
Mick Coady: point out that are our next meeting will be in January and it's set for January six so.
442
00:56:29.400 --> 00:56:39.660
Mick Coady: look forward to seeing everybody back back here, then, and hopefully my plumbing problems will be well in the past, so.
443
00:56:41.040 --> 00:56:42.690
Mick Coady: appreciate appreciate everybody's time.
444
00:56:43.830 --> 00:56:44.730
Brian Vanderwende: Have a good holiday everybody.
445
00:56:45.150 --> 00:56:46.290
Mick Coady: yeah take care.
446
00:56:52.710 --> 00:56:54.150
Brian Vanderwende: plan or in the background next.
447
00:56:56.520 --> 00:57:00.300
Mick Coady: He should be happy now he is good, he was actually pretty good.
448
00:57:01.500 --> 00:57:01.980
Mick Coady: Very good.