12.2.21 - Audio recording text transcript

1

00:00:02.429 --> 00:00:06.810

Brian Vanderwende: Okay, there we go, so the recording of this meeting will be available.

2

00:00:08.250 --> 00:00:10.230

Brian Vanderwende: You know, as soon as we get it processed and uploaded.

3

00:00:11.400 --> 00:00:24.150

Brian Vanderwende: But the basic agenda today it's pretty simple DJ and john I believe we're going to do it sort of a tag team presentation, as part of the Community of practice series of presentations that we've had in these meetings.

4

00:00:25.230 --> 00:00:28.500

Brian Vanderwende: talk a little bit about you know their AIML work.

5

00:00:29.730 --> 00:00:41.520

Brian Vanderwende: And then i'll give a very just a very brief summary of some gpu training that csv and collaboration with tdd is as sort of been designing behind the scenes.

6

00:00:42.990 --> 00:00:43.860

Brian Vanderwende: If you've seen.

7

00:00:45.270 --> 00:00:55.890

Brian Vanderwende: Daniel Howard give talks on this in like a whip talk and some other than us so i'll just give a brief update on that, and then we can do a short Roundtable and we'll wrap up.

8

00:00:57.060 --> 00:01:05.610

Brian Vanderwende: So I believe I saw DJ and john are in here, so if if you both are ready, I can hand over ownership of the screen.

9

00:01:07.800 --> 00:01:14.730

David John Gagne: Yes, thank you Ryan, if you get it over to me, I can share from here okay.

10

00:01:15.480 --> 00:01:16.260

let's see.

11

00:01:23.880 --> 00:01:24.660

I think.

12

00:01:29.640 --> 00:01:31.320

Brian Vanderwende: You have the button or now.

13

00:01:31.620 --> 00:01:35.760

David John Gagne: Of the button by says his participant disabled screen sharing okay.

14

00:01:36.960 --> 00:01:38.940

David John Gagne: or host disabled participants screen sharing.

15

00:01:39.630 --> 00:01:43.230

Brian Vanderwende: Okay let's flip that let's see all participants.

16

00:01:43.950 --> 00:01:50.340

David John Gagne: Can you make me a Co host now I fix it or all participants also works yeah.

17

00:01:51.600 --> 00:01:55.470

Brian Vanderwende: yeah yeah I don't think I can make you co host actually but there we go.

18

00:02:01.860 --> 00:02:02.730

David John Gagne: Right so.

19

00:02:03.330 --> 00:02:04.440

john and I will be.

20

00:02:05.580 --> 00:02:10.650

David John Gagne: kind of talking about machine learning challenges and opportunities on encourage PC.

21

00:02:11.130 --> 00:02:20.730

David John Gagne: i'm going to start off and kind of give like a sort of a big picture overview and kind of talk about some areas that have where I think gpus could be used, but.

22

00:02:21.450 --> 00:02:32.250

David John Gagne: Like need more time investment in order to do so, and then john will talk about kind of some of the ways we're currently using gpus on casper and for some of our different machine learning tasks.

23

00:02:33.960 --> 00:02:34.530

David John Gagne: So.

24

00:02:36.150 --> 00:02:37.110

David John Gagne: to kick things off.

25

00:02:39.750 --> 00:02:44.370

David John Gagne: Like fortman following anything related to hbc a.

26

00:02:45.390 --> 00:02:49.290

David John Gagne: lot there's always lots of discussion on how machine learning is a really fast growing usage area.

27

00:02:49.890 --> 00:02:51.660

David John Gagne: And a lot of hype around it.

28

00:02:51.750 --> 00:02:54.840

David John Gagne: But also, I think a fair bit of substance to so.

29

00:02:56.610 --> 00:02:56.790

David John Gagne: It.

30

00:02:59.250 --> 00:03:06.630

David John Gagne: sounds a bit more both pipe and substance, so a lot of the machine learning gpu usage tends to focus on things like.

31

00:03:07.320 --> 00:03:12.570

David John Gagne: How fast, can you train the model or how fast, can you run it on whatever gpu with other server architecture.

32

00:03:13.110 --> 00:03:29.370

David John Gagne: As example here's a chart from nvidia showing are like performance of the different nvidia Ai on different systems versus the gpus or or other specialized chips.

33

00:03:30.690 --> 00:03:43.050

David John Gagne: Know hustling kind of cryptic or an empty or 3D unit or bird or whatever the link to kind of dip different usually image processing rewards language model kind of tasks.

34

00:03:46.230 --> 00:04:00.690

David John Gagne: And things like this it's so it's useful to kind of keep track of all these benchmarks and like there I think representative tasks for deep, like some of the a subset of deep learning computationally intensive machine learning deep learning tasks in general.

35

00:04:03.000 --> 00:04:09.870

David John Gagne: But they really only represent a fairly small part of the possible usage of like gpus and machine learning.

36

00:04:12.690 --> 00:04:27.720

David John Gagne: and also a very small part of the machine learning pipeline, in fact, like talk about this old normal why the machinery pipeline is like the training part is actually relatively small part of the process there's a lot that goes on before and what happens afterward.

37

00:04:28.740 --> 00:04:35.010

David John Gagne: That like acquires a lot of compute and people attention especially people time.

38

00:04:36.660 --> 00:04:45.330

David John Gagne: So, want to suggest a few areas to facilitate brother gpu usage of machine learning on in car hbc I think we should try to encourage some.

39

00:04:45.360 --> 00:04:49.380

David John Gagne: More gpu sister our pipeline, especially after he has deployed, and we have more.

40

00:04:49.380 --> 00:04:50.430

David John Gagne: gpus available.

41

00:04:51.540 --> 00:04:58.740

David John Gagne: Right now, with casper I think it might sound of it could be justified, but, but if it's like overly encouraged, then we make like.

42

00:05:00.030 --> 00:05:08.010

David John Gagne: may run into the problem of gpus being over utilized and and one Q wait times that that would.

43

00:05:09.330 --> 00:05:13.410

David John Gagne: kind of slow down some of the other like product like productivity at the researchers.

44

00:05:15.810 --> 00:05:20.130

David John Gagne: So kind of give everyone a broad sense of what what is the mission hiring pipeline look like.

45

00:05:22.230 --> 00:05:24.450

David John Gagne: let's kind of the first part is.

46

00:05:25.140 --> 00:05:27.090

David John Gagne: Defining the problem, and this is.

47

00:05:27.570 --> 00:05:31.260

David John Gagne: Like small compute big people time.

48

00:05:32.370 --> 00:05:40.950

David John Gagne: aspect of the of the exercise is also a lot of meetings between machine learning people in domain experts and kind of like sketching out, but what what is the.

49

00:05:42.960 --> 00:05:53.280

David John Gagne: Like what the data says what was the scientific outcome or hypothesis we're trying to test like what data baselines and metrics and stuff should we be using.

50

00:05:55.110 --> 00:05:58.680

David John Gagne: And then there's a whole process of data gathering and data processing that's.

51

00:05:59.070 --> 00:06:05.220

David John Gagne: often takes up about 60 to 70% of a machine learning person's time.

52

00:06:06.660 --> 00:06:17.070

David John Gagne: Just because, like getting the data in the right format is like you often you don't get a successful missionary models, the data is properly structured and you know going variables and predicting the right thing.

53

00:06:18.870 --> 00:06:21.030

David John Gagne: Then there's the compute intensive but off, but.

54

00:06:21.060 --> 00:06:21.600

Matthias Rempel: Maybe less.

55

00:06:21.810 --> 00:06:24.390

David John Gagne: People time intensive part of model selection on the training.

56

00:06:26.760 --> 00:06:34.830

David John Gagne: And all without us also usually once you have a set of train models, you want to be using a lot of evaluation to see how well they're performing and.

57

00:06:35.250 --> 00:06:43.470

David John Gagne: really have any issues were like and we went off and get some science scientific justification over so there's an interpretation process that goes along with this, too.

58

00:06:44.220 --> 00:06:54.030

David John Gagne: And once you've gone through all of this step, then usually we will want to eventually deploy this in some form and research contacts right like We may want.

59

00:06:54.060 --> 00:06:59.460

David John Gagne: We have a system that works really well, we wanted to hand off to someone like Noah or private company or.

60

00:07:01.350 --> 00:07:12.960

David John Gagne: or or even like running it like sort of in closet real time lead ourselves to to turn out well it operates, there may also be planted operations for like non say.

61

00:07:14.400 --> 00:07:19.710

David John Gagne: What non whether kind of stuff but but say have an htc system that I know there's been a number of projects.

62

00:07:19.710 --> 00:07:20.160

Matthias Rempel: Doing.

63

00:07:20.640 --> 00:07:22.080

David John Gagne: Trying to do machine learning to help.

64

00:07:22.110 --> 00:07:40.740

David John Gagne: optimize things like like like alpha like, how can be efficiently subset data or manage come up with keywords for later kind of kind of or data archive to populate like less well described data sets or.

65

00:07:41.100 --> 00:07:57.090

David John Gagne: me like i'm sure other than the ratio, maybe there's some machine learning systems don't manage like the date, like the data Center in cheyenne and they're like heating and cooling stuff but I know I know there's people working on that for other data centers but.

66

00:07:58.380 --> 00:07:59.910

David John Gagne: The main like I think there's other there's.

67

00:08:00.180 --> 00:08:04.470

David John Gagne: Other opportunities are there not just like machine learning for their climate for can be used as.

68

00:08:04.470 --> 00:08:04.860

well.

69

00:08:07.740 --> 00:08:11.640

David John Gagne: I have a list here of someone called beta pre processing hbc challenges.

70

00:08:13.110 --> 00:08:15.510

David John Gagne: I think are issued some of these are issues that are.

71

00:08:15.750 --> 00:08:17.250

David John Gagne: Software based some of them are ones were.

72

00:08:17.640 --> 00:08:18.780

David John Gagne: GPS could benefit.

73

00:08:20.520 --> 00:08:24.060

David John Gagne: One area that's always a challenge getting access to all data.

74

00:08:26.580 --> 00:08:30.030

David John Gagne: For like the same machine learning problem versus like a simulation problem.

75

00:08:30.990 --> 00:08:46.200

David John Gagne: With machine learning, we often need access to like a large archive of data and not just say a single event for like running late initializing a like a single simulation now we want the you know 50 years of 22 tracks or.

76

00:08:47.310 --> 00:08:52.560

David John Gagne: Early era we want to do something on all there were five or us on board subset of it.

77

00:08:54.930 --> 00:08:56.880

David John Gagne: To so there's like a lot more.

78

00:08:56.970 --> 00:08:57.420

David John Gagne: kind of.

79

00:08:57.510 --> 00:09:04.200

David John Gagne: A friend dena downloading and processing that's often required to do any kind of machine learning tasks.

80

00:09:05.430 --> 00:09:12.000

David John Gagne: In some areas where we have what kind of kind of gotten more established working with the data so that this becomes less of an issue but.

81

00:09:13.320 --> 00:09:14.520

David John Gagne: Still definitely in the.

82

00:09:16.050 --> 00:09:26.820

David John Gagne: lake lots of people wanting to apply machine learning to their own problems and figuring out that step, so there's a lot of piloting and building up of this data infrastructure, making things in already and and.

83

00:09:28.110 --> 00:09:28.830

David John Gagne: And whatnot.

84

00:09:30.750 --> 00:09:46.860

David John Gagne: When challenge of this is also like depending on where the data is located it's on glee, then think life is relatively easy for us, but then some some data sets that are entirely on the cloud, and maybe like we don't want to download the entire data set off the cloud on to.

85

00:09:46.890 --> 00:09:48.390

David John Gagne: enclave, you want to be able to.

86

00:09:48.540 --> 00:09:55.350

David John Gagne: stream it as needed and figuring out how to efficiently do that it's like it's an ongoing thing we're working on.

87

00:09:56.280 --> 00:10:11.760

David John Gagne: It it'll still I data on external data services that you know, like our system grid or NASA tape drives of satellite data that are where they're so big bottleneck and getting that getting the data from its original source.

88

00:10:14.790 --> 00:10:19.830

David John Gagne: I know there's like efforts outside what we're doing to to help with that, but but.

89

00:10:20.520 --> 00:10:24.870

David John Gagne: Those efforts also helping denture machinery tests being done on on from that kind of data.

90

00:10:26.310 --> 00:10:31.380

David John Gagne: there's also a lot of marketing on data conversion that we often have to do this is like things like interpolation.

91

00:10:31.380 --> 00:10:35.340

David John Gagne: regretting converting from, say, like a role.

92

00:10:36.480 --> 00:10:51.990

David John Gagne: Like domain specific format to something that's more portable like net CDF where's our sub setting relevant data for like grab what we need for the machine learning problem, maybe, sometimes we need to shuffle or overlay or.

93

00:10:54.270 --> 00:11:02.100

David John Gagne: Like reshape the the data, the data access patterns, sometimes be a bit different for like some machine learning tasks, then for a.

94

00:11:03.390 --> 00:11:07.290

David John Gagne: Like a simulation tasks, because we may need to grab like say every storm in a.

95

00:11:08.700 --> 00:11:13.620

David John Gagne: Climate simulation so there's like a object orientation aspect aspect of that.

96

00:11:14.850 --> 00:11:18.780

David John Gagne: or needing to do what random access across multiple years to build a big.

97

00:11:18.840 --> 00:11:21.000

David John Gagne: batch of data to go into a neural network.

98

00:11:23.070 --> 00:11:23.490

David John Gagne: and

99

00:11:23.820 --> 00:11:28.020

David John Gagne: And when doing that at scale that there's certain ball next to work around their.

100

00:11:30.510 --> 00:11:35.430

David John Gagne: Mother test often is done in the pre processing step is this kind of scaling standardization we're we're having to like.

101

00:11:37.830 --> 00:11:39.870

David John Gagne: calculate like me and standard deviation and.

102

00:11:41.460 --> 00:11:50.880

David John Gagne: convert the data into that into that format from its original raw values or do a log transform or or some other kind or calculate derived variables or.

103

00:11:53.040 --> 00:12:00.450

David John Gagne: And then we have to keep all that around or figure out some like there's like do we need another set of pre process data to work with.

104

00:12:04.530 --> 00:12:06.300

David John Gagne: And within that, how can the gpu.

105

00:12:06.300 --> 00:12:06.630

help.

106

00:12:08.010 --> 00:12:13.140

David John Gagne: Right now Moses, I also have described is pretty is pretty cpu based for the most part.

107

00:12:13.800 --> 00:12:19.680

David John Gagne: There are a few libraries that are coming out that the thing nvidia's has a rapids library that has like.

108

00:12:20.160 --> 00:12:30.630

David John Gagne: Has like a cool new American cudi F that supposed to replace the functionality know pine pandas and some of that might help with some of these tasks that are like more compute intensive rather than io intensive.

109

00:12:33.090 --> 00:12:37.740

David John Gagne: And then those I think there could be some additional benefit, where the gpus can be used if available.

110

00:12:38.430 --> 00:12:41.520

David John Gagne: The only concern courses, the I O bottlenecks that come along with.

111

00:12:41.610 --> 00:12:51.180

David John Gagne: me to load the data like it's already pretty high of the operations loading the data and from disk to memory or from downloading from the cloud to disk to memory.

112

00:12:52.740 --> 00:12:58.590

David John Gagne: The adding another step like how much of how much of an overhead costs is that.

113

00:13:00.090 --> 00:13:07.110

David John Gagne: Number also have some ability to have to work on the gpu and help convert Python code to more gpu wrong boca without.

114

00:13:09.570 --> 00:13:11.280

David John Gagne: out too much work, I think.

115

00:13:12.750 --> 00:13:21.690

David John Gagne: So that would be another thing to look into supporting but currently we're not using any of these like four.

116

00:13:21.900 --> 00:13:24.540

David John Gagne: And their gpu versions of this time.

117

00:13:27.360 --> 00:13:32.340

David John Gagne: On the other end of the spectrum is evaluation interpretation so.

118

00:13:35.250 --> 00:13:44.970

David John Gagne: In terms of hbc challenges, this one is when you're like training a model your visual I see how it's doing over it's time period to see if you need to stop training or.

119

00:13:45.480 --> 00:13:56.220

David John Gagne: Make sure it's actually converging TOR TOR TOR, like a lower value, instead of like sometimes you'd have something set up wrong, it will flatline second be a problem.

120

00:13:57.600 --> 00:14:00.210

David John Gagne: We also are doing large amounts of machine learning inference.

121

00:14:01.470 --> 00:14:06.780

David John Gagne: As part of the evaluation interpretation process, so making this efficient is quite useful.

122

00:14:09.750 --> 00:14:23.610

David John Gagne: When when things are starting to run into some some of the bigger deep learning models and things like units is that remember generating predictions or say i'll put like heat maps and sailing team Apps and things of that nature.

123

00:14:25.140 --> 00:14:32.850

David John Gagne: That that exercise itself creates a large amount of data, now we need to figure out how to best manage that the data we're generating not just the data we're reading in.

124

00:14:36.240 --> 00:14:44.430

David John Gagne: there's always ongoing challenges of like doing our offline evaluations versus putting the missionary mo in the client model or the weather model and running that.

125

00:14:44.910 --> 00:15:00.150

David John Gagne: and potentially even doing like a fully integrated kind of like train the model using simulation rooms, as part of the training pipeline we're not there yet, and in terms of doing doing that because the same world more integration to be done.

126

00:15:01.680 --> 00:15:09.780

David John Gagne: By people are definitely moving in that direction and it's a it's going to be use case will expect to see running on Andre CIO.

127

00:15:10.860 --> 00:15:14.400

David John Gagne: In the next you know easily and reaches lifetimes not before them.

128

00:15:15.720 --> 00:15:21.480

David John Gagne: And finally, think when challenged anything we're doing but it's always a challenge is.

129

00:15:22.200 --> 00:15:31.170

David John Gagne: Implementing scoring interpretation functions in like tensorflow and pie towards versus like pretty unknown pie or regular Python the you having to use like.

130

00:15:31.950 --> 00:15:39.120

David John Gagne: Suddenly different API can good causes what lots of productivity issues and friction and annoyances.

131

00:15:39.840 --> 00:15:57.990

David John Gagne: Because both tensorflow and paid for each other own array system under the hood that is mostly like i'm pie, but not exactly like comply and there were like tensorflow is exposed as an experimental dump API and torture thing works more friendly with them fiber not perfectly so.

132

00:15:59.010 --> 00:15:59.550

David John Gagne: The.

133

00:16:01.200 --> 00:16:15.780

David John Gagne: standardizing api's in general I think helps a lot with with some of these issues so So if you can just drop in saying as cheaper support gpu stuff more seamless as possible than than the increase the human productivity quite a bit.

134

00:16:17.130 --> 00:16:20.100

David John Gagne: And i'll turn it over to john to kind of talk about.

135

00:16:21.660 --> 00:16:25.380

David John Gagne: The the so our existing stuff going on.

136

00:16:27.240 --> 00:16:28.770

David John Gagne: So john Ray take it over.

137

00:16:29.340 --> 00:16:33.480

John Schreck: yeah D uh can I do want me to screen from you.

138

00:16:34.260 --> 00:16:36.120

David John Gagne: Just you can just tell the next slide.

139

00:16:37.320 --> 00:16:37.740

David John Gagne: that's fine.

140

00:16:40.920 --> 00:16:44.700

John Schreck: yeah that's fine um yeah so uh.

141

00:16:46.260 --> 00:16:54.840

John Schreck: I apologize, in advance, some of these slides maybe a little bit of rehash if you were at the whip talk a couple months ago but i'm trying to mostly kind of focus on what we're doing with gpus.

142

00:16:55.620 --> 00:17:03.270

John Schreck: So just to kind of start off a little bit here um one of the big things that we have to kind of worried about is selecting a model.

143

00:17:04.500 --> 00:17:09.270

John Schreck: So just to kind of give you sort of a generic idea of neural nets.

144

00:17:10.350 --> 00:17:17.490

John Schreck: The top diagram here kind of shows you on the Left sort of like the inputs to a neural network model Okay, and it could be.

145

00:17:18.000 --> 00:17:26.040

John Schreck: Numbers like floating point numbers are real numbers, it could be like images like kind of two dimensional arrays and they can be sequential data right like.

146

00:17:26.460 --> 00:17:31.530

John Schreck: Like like sentences, for instance, or like a sequence of floating point numbers okay.

147

00:17:32.010 --> 00:17:38.520

John Schreck: um so that gets passed through these neural network architectures and you know i'm kind of spare you the details just sticking with the illustration here.

148

00:17:38.820 --> 00:17:49.050

John Schreck: um, as you can kind of see it's sort of three main parts in the simplistic diagram it's the input, the sort of in between area hidden layers and then there's this model output okay and sort of.

149

00:17:50.280 --> 00:18:03.510

John Schreck: Roughly or you know generically the model, the data on the Left gets passed through the model and the model sort of kind of learn something about this data set okay and it's not necessarily clear to us, yet what these models are doing.

150

00:18:04.320 --> 00:18:07.560

John Schreck: But nevertheless, it will oftentimes perform well.

151

00:18:08.010 --> 00:18:11.970

John Schreck: For instance, what spinning out on the right side of this model of this diagram here.

152

00:18:12.210 --> 00:18:16.590

John Schreck: Okay, and just to kind of like remind you, the output could also be very similar to what the input was.

153

00:18:16.770 --> 00:18:26.250

John Schreck: Right, it could just be a single output so could just be like a number you're trying to predict like I don't know some like a mass or something like that or some numerical floating Point number, it could also be like a.

154

00:18:27.900 --> 00:18:33.000

John Schreck: integer number, such as like a one or two or three which you should think of as.

155

00:18:33.510 --> 00:18:42.780

John Schreck: A label, for instance, so like the input might be like an image of a giraffe right and the output is you want to pick the label that refers to giraffe and not all the other labels, you could have chosen from.

156

00:18:43.320 --> 00:18:49.800

John Schreck: Okay, so that's sort of an example of like a classification problem right um but you know sort of above all this really like.

157

00:18:50.430 --> 00:18:59.430

John Schreck: pendant of what you're kind of feeding into these models what's going on in between the models what's coming out of the models, you have to worry about certain settings in order to get this model to actually perform for you.

158

00:18:59.910 --> 00:19:09.390

John Schreck: Okay, so and actually I sort of just kind of generically for these things is hyper parameters alright, so in this example, it could be like well how many hidden layers you want.

159

00:19:09.870 --> 00:19:20.760

John Schreck: Okay, you have to choose that right or like if you have three hidden layers like well how many like nodes do you want right like the green dots do you want to each of the headlines doesn't have to be the same like it's drawn here okay so.

160

00:19:21.390 --> 00:19:35.730

John Schreck: A bad choice and I kind of put quotes around that basically leads you to a poor model and sometimes that's what a bad choice is right So how do you choose wisely I guess Okay, so that i'll get into that and what we're doing with the gpus mainly here on casper.

161

00:19:36.930 --> 00:19:39.960

John Schreck: yeah thanks a lot so uh.

162

00:19:40.380 --> 00:19:47.070

John Schreck: For the most part, a lot of the neural nets that we're dealing with tendons there's growing in size and David john kind of pointed out like you know that.

163

00:19:47.130 --> 00:19:52.800

John Schreck: there's a lot of the benchmarks they're sort of been reported are using certain models that are different sizes and doing different things.

164

00:19:53.730 --> 00:19:59.700

John Schreck: So, for the most part, like at some point, like you're going to have to wind up using gpus because training these models involves like loads of.

165

00:20:00.240 --> 00:20:06.360

John Schreck: Numerical calculations that need to be done on the gpu okay so i've kind of pointed out like two cases here really.

166

00:20:07.350 --> 00:20:14.610

John Schreck: The first one is sort of one i'm not really going to get into which is your model and your data doesn't fit onto one gpu so you need more than one gpu.

167

00:20:14.970 --> 00:20:20.040

John Schreck: Okay, and two pi torch and tensorflow and there's other libraries out there are these sort of the two most popular ones.

168

00:20:20.460 --> 00:20:31.320

John Schreck: They have been working a lot lately to really make it a lot easier to, for instance, you can break up your data your batches of data that you're passing through to the Left in this diagram coming out on the right and then fitting into the truth.

169

00:20:32.130 --> 00:20:41.010

John Schreck: That can actually be kind of broken up and handed off to a bunch of different gpus and that data can then become recombined to perform a single up wait update on the model.

170

00:20:41.700 --> 00:20:49.290

John Schreck: there's ladders what's more interesting to me in some senses Okay, the model in the day to fit on one gpu but how much of those resources are leftover.

171

00:20:49.680 --> 00:21:03.870

John Schreck: Okay, because right now, the way the system works on casper is that if you ask for a gpu you get all all of the memory that comes with it Okay, but it's very often the case that people use gpus and they're not using all 32 gigs, for instance on some of our notes.

172

00:21:04.920 --> 00:21:09.960

John Schreck: So can we use those leftover resources next slide please.

173

00:21:12.030 --> 00:21:19.680

John Schreck: Alright, so just to kind of go back into the first part about like, how do we pick the right hyper parameters, we get a good machine learning model so i'll try to kind of.

174

00:21:20.190 --> 00:21:28.410

John Schreck: go over some of the stuff relatively quickly um there's a few kind of main steps involved here, and you know the diagram on the Left kind of like illustrates, to some extent.

175

00:21:29.040 --> 00:21:37.920

John Schreck: You have some objective when you're training a model that you're trying to achieve right, but you want to minimize some quantity right and it's an example here is the mean absolute error okay.

176

00:21:39.060 --> 00:21:45.960

John Schreck: When you have to choose hyper parameters, you know i'm highlighting you know, for instance, the learning rate in the number of neurons right, so you might just pick like I don't know learning rate of.

177

00:21:46.440 --> 00:21:53.070

John Schreck: One 10th and 15 hit number neurons you train the model, you get in I mean absolute error that's a trial.

178

00:21:53.580 --> 00:22:00.810

John Schreck: Okay, so study, then, is when we do that whole bunch of time all right, we just pick different combinations of learning right and neurons we train them on and get a.

179

00:22:01.380 --> 00:22:09.420

John Schreck: optimization objective value at the end of the day, OK, so the way that you can sample these, for instance, is by just guessing right which is.

180

00:22:09.990 --> 00:22:13.710

John Schreck: If you're not very well informed on how the neural that's going to work that's random search.

181

00:22:14.100 --> 00:22:19.380

John Schreck: um we I will show you in a few slides later that we use a combination of random searching and then we use what.

182

00:22:19.710 --> 00:22:31.890

John Schreck: i'm just going to refer to as an informed search, which is really a gaussian mixture model which is based on bayesian statistics that helps to try to leverage observations that you've already made right that's basically like.

183

00:22:33.000 --> 00:22:38.520

John Schreck: What was a trial, what was the outcome and then tries to use that to pick a better next set.

184

00:22:40.680 --> 00:22:41.370

John Schreck: Next slide please.

185

00:22:43.380 --> 00:22:53.160

John Schreck: Alright, so um we've been working on a little package here called ECHO I think i've reiterated this before it's called earth computing hyper parameter optimization it's a distributed multi gpu approached hyper parameter opt.

186

00:22:54.120 --> 00:22:59.760

John Schreck: there's my github for it a little bit of a mess right now we're really we're trying to get an update out like pretty quickly here.

187

00:23:00.630 --> 00:23:05.880

John Schreck: Overall it's fairly we're trying to make it pretty easy to use, I don't in some sense, like a solo kind of hard to use.

188

00:23:06.510 --> 00:23:15.600

John Schreck: The new updates kind of try to make that a lot easier for other people and car who want to do machine learning but aren't necessarily like doing it every day right and it's like it just trying to get into it.

189

00:23:16.020 --> 00:23:23.010

John Schreck: So for now there's some dependencies i'm not going to get into it too much later on, I hope, to give you an option, where you can actually just upload your data set.

190

00:23:23.400 --> 00:23:28.380

John Schreck: Or you know, like Patrick data set to ECHO and it might even possibly be able to suggest models for you.

191

00:23:29.370 --> 00:23:35.280

John Schreck: Right now everything is sort of you have to kind of pick a model that you want to optimize on right and then like.

192

00:23:35.580 --> 00:23:44.040

John Schreck: ECHO will take over from there, but it's still up to you to sort of you know if it's a language processing model, you still have to pick the appropriate model next slide please.

193

00:23:46.350 --> 00:23:56.580

John Schreck: Alright, so the way that this sort of works here at n car is you know, for the most part we have casper and cheyenne casper is the one that has gpus but sometimes models are right at the cusp or you could use both.

194

00:23:57.660 --> 00:24:04.650

John Schreck: So the way that we kind of distribute all this is by initiating a database entry for a study.

195

00:24:05.220 --> 00:24:14.130

John Schreck: Okay, so I want to optimize model i'm gonna do a bunch of trials, so I will save all of that data in a database okay it's part of the trial or the studies.

196

00:24:14.430 --> 00:24:16.440

John Schreck: record okay so.

197

00:24:16.740 --> 00:24:27.330

John Schreck: That means that I can basically meet at all, I really need to be able to do is writing this database, which means I don't have to have everything on the same computer I could have stuff and you know, like South America, but I really wanted to as long as like a bring the database.

198

00:24:28.110 --> 00:24:36.570

John Schreck: So that basically means I can run as many trials, as I want simultaneously and Max out my resources okay and that's the objective here right and that's what the drawing kind of trying to show you.

199

00:24:36.990 --> 00:24:44.670

John Schreck: All right, so i'm just kind of showing you workers and this to kind of mean node okay on gpu or like you know if you ask for.

200

00:24:45.210 --> 00:24:55.170

John Schreck: You know, know you got to come specify what you want right you want one gpu you want a gpus whatever but that's when I say worker that's kind of what I have in mind, just for the purpose of the way I made my slides next slide please.

201

00:24:57.060 --> 00:25:04.350

John Schreck: So, within each worker or node right there's going to be a gpu let's just say there's one gpu I just asked for one gpu per worker.

202

00:25:04.620 --> 00:25:09.450

John Schreck: Okay, and let's just suppose that my model and the memory, you know, the amount of data and the model.

203

00:25:09.870 --> 00:25:24.390

John Schreck: there's going to be mounted on the gpu maximum only at once, is only ever going to be let's just say 140 total memory available well, that means that I can put four models and the data basically copying it four times on to the gpu okay and that's what ECHO does.

204

00:25:25.470 --> 00:25:29.520

John Schreck: So in this example here I mean it's kind of trivial math right it's like all right cool I have like.

205

00:25:30.420 --> 00:25:40.230

John Schreck: Three workers, I got four gpus eat, so I just now I have 12 things going at once Okay, the main problem here really is I it's tough to write a thing for me.

206

00:25:41.220 --> 00:25:46.230

John Schreck: To really accurately estimate like how much resources, you really going to need up front right and.

207

00:25:46.710 --> 00:25:53.250

John Schreck: Right now, the objective is probably just to leave it to you the user is like well how many times you want to try them out your model on the same gpu.

208

00:25:53.520 --> 00:26:02.010

John Schreck: Later on, it would be ideal to not so you don't have to do that at the end of the day, I really want echoed icon to be as simple as possible and to be as user friendly as possible because.

209

00:26:02.310 --> 00:26:11.760

John Schreck: A lot of these like nitty gritty like gpu tricks and things like that, like you, don't need most people care about that you don't need to know that you need to get your model number four and need to get the best model that you can get.

210

00:26:12.180 --> 00:26:18.390

John Schreck: Okay, and that's all you mean obviously that's ultimately our objective I just hit next slide on my keyboard but I guess that's.

211

00:26:19.560 --> 00:26:28.170

John Schreck: Alright, so i'll try to go through my example here holodeck is this is probably our project that really uses the gpus the most and to certain degree, so holodeck is a hologram.

212

00:26:29.640 --> 00:26:39.180

John Schreck: it's a harder detector um we have our collaborators here are Aaron Ben summer and matt heyman over and AOL So you can see here the detectors mounts to a plane.

213

00:26:39.810 --> 00:26:47.640

John Schreck: mounted multiple planes are suited for planes here, and you basically fly this thing around clouds Okay, and what it does is it's able to as it's.

214

00:26:47.970 --> 00:26:55.440

John Schreck: Written up at the top is you're able to determine size and the two dimensional shape three dimensional three dimensional position of High drama letters.

215

00:26:55.860 --> 00:27:04.320

John Schreck: For instance, in these clouds Okay, so that the main thing we're going to be interested in our small like liquid or like water particles, but i'll show you a few examples, at the end that was, like other stuff.

216

00:27:04.470 --> 00:27:11.910

John Schreck: Like in these holograms and some of the things we don't even really know what they are okay so next slide i'll give you a couple examples here of what this looks like.

217

00:27:12.510 --> 00:27:19.140

John Schreck: So the image on the left is a real example of a hologram that came from holodeck I just refer to it as the holodeck holograms.

218

00:27:19.440 --> 00:27:25.980

John Schreck: And I kind of zoom in on a particle there the particle is not in focus here it's some distance away like into the page from here.

219

00:27:26.430 --> 00:27:32.340

John Schreck: Okay, on the right is a simulated hologram Okay, you can see, is like a lot better looking at right it's.

220

00:27:32.940 --> 00:27:44.520

John Schreck: Even though that particle on the right there is not in focus it's like pretty obvious compare the one on the left, which is kind of fading and, if you look very carefully on the Left there's like weird stuff which is noise and other things that are you know that are.

221

00:27:45.840 --> 00:27:48.300

John Schreck: I don't want to say that their artifacts but they're there they're.

222

00:27:49.440 --> 00:27:53.100

John Schreck: These holograms are very large know that they're megapixels and sighs.

223

00:27:53.370 --> 00:27:58.470

John Schreck: Okay, and the little squares and actually have drawn a little Insects are actually 512 by 512 sub samples.

224

00:27:58.680 --> 00:28:09.420

John Schreck: Which is actually the size we're going to feed into a neural net, because even that's kind of big so there's no chance right now that you could just take this giant image and like feed it through neural net and let's not run out of gpu memory real quick.

225

00:28:10.500 --> 00:28:24.750

John Schreck: Among other things, so the current way that you extract these particles right now is this program called hollis sweet does not involve machine learning and it's based on some physics calculations, so our main question is, can we get a neural net to do better than that next slide please.

226

00:28:26.730 --> 00:28:37.560

John Schreck: So I had that example you kind of showed there before it's kind of clear that that's a plane right and that's the XY plane and then, if you're able to sort of like locate where the article is that you got X, Y.

227

00:28:37.950 --> 00:28:44.640

John Schreck: Okay, but Z is kind of stymied us for about a year and we actually kind of wound up sort of taken a little bit of a.

228

00:28:45.240 --> 00:28:54.750

John Schreck: play out of the hollow sweet playbook, which is to take advantage of something called wave propagation and all that really means is is that hologram is an electromagnetic field that we're looking at it's just been.

229

00:28:55.590 --> 00:28:59.250

John Schreck: processed in a way that we can actually make sense of it with our eyes.

230

00:29:00.000 --> 00:29:12.240

John Schreck: But as such it's governed by laws of physics and we can take advantage of those to basically take that image and reconstructed that hologram at some other distance eat into the plane okay.

231

00:29:12.810 --> 00:29:19.740

John Schreck: So, at a certain distance is the particles become in focus right now in order for us to take advantage of that right to.

232

00:29:20.580 --> 00:29:33.450

John Schreck: way property some Z and we say oh yeah let the particles in focus there so that's where it is Z, we have to take advantage, we have to use for a transforms, in fact, we have to do one, and then we have to do an inverse for a transform after that, in order to get away from.

233

00:29:34.590 --> 00:29:40.680

John Schreck: The pictures just kind of show you two different ways of how the waves can come in and we actually take advantage of the one on the left, where the.

234

00:29:40.920 --> 00:29:47.460

John Schreck: hologram thing is like kind of just shooting out sort of like playing incident ways we don't have like a radio Lee emanating detector like the one on the right.

235

00:29:48.600 --> 00:30:01.380

John Schreck: So I wanted to note that, as of pie charts 1.9 they started supporting fast fourier transform we've got a written in ourselves but it's a lot easier to just use their you know method that just just call it, and it was already natively gpu.

236

00:30:02.820 --> 00:30:11.790

John Schreck: You know oriented so, in other words, is done on the gpu Okay, and this is part of in some sense like data prep processing, not so much the model yet.

237

00:30:12.930 --> 00:30:13.710

John Schreck: Next slide please.

238

00:30:15.570 --> 00:30:18.600

John Schreck: Let me know, by the way, Brian if I run out of time or you need me to stop.

239

00:30:19.680 --> 00:30:27.540

John Schreck: So here's a model Okay, and the diagram on the left just want to show you that's the same little insect pictures that I showed in the previous those out of focus particles.

240

00:30:27.870 --> 00:30:32.880

John Schreck: And then the little sub panel on the right of that shows me during that wave prop calculation on the gpu.

241

00:30:33.390 --> 00:30:40.980

John Schreck: To the Z where the particles and focus and when you can smell it's just like this nice little dark dark right and that's the particles diameter that you're looking at.

242

00:30:41.460 --> 00:30:47.970

John Schreck: Okay, so that's what we want to estimate right, we want to be able to get the X, Y and the plane, and we want to be able to wave prop to a Z that's pretty close.

243

00:30:48.180 --> 00:30:53.460

John Schreck: Right, where the park looks like it's in focus and then we can get an estimate of that diameter okay so.

244

00:30:54.120 --> 00:31:02.940

John Schreck: These are the things that go into a neural net i'm not going to give you any details on it other than its large it's a unit, you know what it is units and other types of models can output.

245

00:31:03.630 --> 00:31:06.900

John Schreck: Basically, like image types, but i'm going to refer to as a mask here.

246

00:31:07.200 --> 00:31:15.750

John Schreck: And that mask on the right side is basically just binary right so it's predicting like zeros where there's no particle and focus and ones where it isn't focus right so it's a little.

247

00:31:16.230 --> 00:31:29.370

John Schreck: circle that i'm trying to draw around a circle i'm trying to like kind of scratch in right like basically like where the particle is so once I have a good mass prediction on the right, like, I can back out the diameter friant next slide please.

248

00:31:31.200 --> 00:31:44.850

John Schreck: um alright so here's me echo optimization of this model Okay, and I will i'll spare you the details, I tried to optimize quite a good number of hyper parameters, I tried a bunch of different types of objective loss functions.

249

00:31:45.330 --> 00:31:55.710

John Schreck: I tried a whole range of segmentation models and I varied the types of pre train layers in the encoder That was a hyper parameter as well i'm.

250

00:31:56.940 --> 00:32:08.160

John Schreck: Details the side, you can see that I drew a horizontal or vertical green dashed line that basically separates the random selection phase of hyper parameters from the informed selection.

251

00:32:08.520 --> 00:32:22.380

John Schreck: Right it's pretty obvious right to the left, like the blue dots are just kind of all over the place, and then, as soon as I, you know the model starts trying to take advantage of what it's already not the model, but the you know our optimization objective or package is trying to.

252

00:32:23.430 --> 00:32:30.780

John Schreck: leverage the trial that it's already seen to make better choices and that's pretty obvious right like look at how the blue dots kind of collapse.

253

00:32:31.140 --> 00:32:36.810

John Schreck: Like especially around 130 and that keeps going down a little bit, but we didn't see too much improvement and for about 250 so.

254

00:32:36.990 --> 00:32:50.400

John Schreck: This is not cheap right, I mean this is already taken advantage of that for X for for models for gpu does more than 400 trials here there's actually more than that i'm not showing all of them I don't know how many gpu hours this took I did it when you all weren't using it.

255

00:32:51.570 --> 00:32:52.470

John Schreck: Next slide please.

256

00:32:54.000 --> 00:33:02.280

John Schreck: So how do I actually use this thing once it's train, I noted that you have to wave prop to a Z so we don't know where the particles are so we have to just wait for up to a whole bunch of different.

257

00:33:02.580 --> 00:33:09.030

John Schreck: z's and and try to figure out if there's particle there okay so that's a choice, the number of Z planes that are more to reconstruct the hologram at.

258

00:33:09.390 --> 00:33:16.410

John Schreck: Every time I do that, like, I have to break down that large hologram and those little 512 by 512 cells pass each one to the model performing reconstruction.

259

00:33:16.650 --> 00:33:24.690

John Schreck: And then I wind up with a full size prediction for masks that each zed value okay i've left out all those details on how I.

260

00:33:24.900 --> 00:33:31.140

John Schreck: kind of shrink down I don't shrink down the full size image I subsample it and then I reconstruct it with the model outputs back into the full.

261

00:33:31.620 --> 00:33:34.710

John Schreck: Size I don't really do much on the GP with that, so I kind of just left it out.

262

00:33:35.580 --> 00:33:41.280

John Schreck: So now you with this pipeline is processed right here is that all the Z all the planes going into the model or I.

263

00:33:41.490 --> 00:33:49.140

John Schreck: mean it doesn't matter like what order they go in right like cm can go first and then he wanted to go later so that just means that I can do them all at once, if I have the resources.

264

00:33:49.440 --> 00:33:56.520

John Schreck: Okay, so I mean I did this on purpose right, so that the algorithm is scale because hollis we cannot scale, not the way that it was written so.

265

00:33:57.420 --> 00:34:05.130

John Schreck: If I have all the gpu you know, especially into Rachel comes around which I have even more resources, I can really push like how fast, I can process holograms.

266

00:34:05.760 --> 00:34:11.850

John Schreck: So just to say right now, it takes about two to five minutes per hologram because it's kind of a lot of processing.

267

00:34:12.270 --> 00:34:21.930

John Schreck: As David john kind of notice there's there's a ton of data that's generated like an intermediary phases here right of things going into the model and then, just like having a nice clean output list of.

268

00:34:22.440 --> 00:34:29.700

John Schreck: particles and, like their locations right which is ultimately what we want, so there's a lot of like difficulty and you know it's really more of like a.

269

00:34:30.390 --> 00:34:35.460

John Schreck: It would take I don't really know what to say about it there's a lot of optimization that could still probably be done.

270

00:34:35.910 --> 00:34:46.410

John Schreck: That you know it would it's gets into sort of interesting usages of gpu and so on, and I will know too that I take advantage of my little like i'm going to melt more than one G or model to a gpu.

271

00:34:46.770 --> 00:34:55.860

John Schreck: When I do this thing over here, I tried to take advantage of every possible resource, I can before I crashed the node okay um next slide please.

272

00:34:57.510 --> 00:35:09.870

John Schreck: So these are just some predictions here of the X, Y and Z in the D coordinates now when I run them all those planes through the model at different disease we actually have to do another post processing.

273

00:35:11.130 --> 00:35:15.960

John Schreck: You know, a calculation that i'm not going to show you because it just gets into too many details and it's already 230 past 230.

274

00:35:16.320 --> 00:35:21.600

John Schreck: So just to say that we'd perform a clustering routine and we use a distance threshold in order to perform that clustering.

275

00:35:22.080 --> 00:35:31.020

John Schreck: That allows me in some sense to kind of like toggle how many particles we actually predict, and then we can align them up with the truth articles which was the simulated holograms that's what this data is.

276

00:35:31.950 --> 00:35:41.070

John Schreck: An overall, you can see, pretty strong agreement it's about an 86% match rate here I picked a threshold value that gave me that on purpose, because i'll get into that next slide.

277

00:35:41.550 --> 00:35:47.040

John Schreck: You can fiddle around with that distance distance threshold, a little bit to kind of toggle the performance that you want.

278

00:35:48.060 --> 00:35:54.240

John Schreck: So just to give you a little example here this gabrielle gatos wrote a really nice visualization script here to help us.

279

00:35:54.750 --> 00:36:03.420

John Schreck: visualize very large number of particles and 3D visualize or here so on the left is the true and the right is the predicted, this is just one random hologram that I picked.

280

00:36:04.110 --> 00:36:09.780

John Schreck: If you look very careful, you can see some differences, I mean it's not perfect, but it was only 86% match in this case.

281

00:36:11.010 --> 00:36:19.890

John Schreck: But it's pretty good right in fact you saw on the pre I didn't I didn't point out in the previous example yeah thanks David john, but if you look at the bottom right the D word an average histogram.

282

00:36:20.250 --> 00:36:26.970

John Schreck: We mostly just suffer a little bit at predicting the smallest particles and again i'm going to tell you that I kind of had to do that on purpose.

283

00:36:27.210 --> 00:36:33.510

John Schreck: I had to introduce noise and as a training in order to get the neural nets to perform better on the real data which had noise in it.

284

00:36:33.990 --> 00:36:44.370

John Schreck: So that noise actually like causes us to make it a little bit harder to predict the smallest particles i'm not going to get into details, but to talk about that mood another time Dave junkie go to slides ahead, please.

285

00:36:46.410 --> 00:36:51.270

John Schreck: So I just wanted to point out now this last table here is a lot of numbers, not going to make you look at all of them.

286

00:36:51.510 --> 00:37:01.440

John Schreck: This is just a comparison between what the neural nets predicting and now the real holograms so all the results that I showed, you were on those perfect synthetic holograms now here's us.

287

00:37:01.800 --> 00:37:14.460

John Schreck: head to head comparison against hollis sweet, and right now what i'm going to do is, we had to manually label these examples Okay, the way that we did that was we took all the predictions from hollis we all the predictions from the neural net and then myself.

288

00:37:15.240 --> 00:37:24.330

John Schreck: matt heyman air vents hammer and gabrielle cantos we manually label them Okay, and the label was a one or a zero and the one was basically what's the particle and focus in my opinion.

289

00:37:24.870 --> 00:37:31.110

John Schreck: And that opinion came with a ranking one to five, and confidence okay so we're not trying to do the xyz D here we're just trying to say.

290

00:37:31.680 --> 00:37:43.320

John Schreck: Which one is right, because how this way it's not the truth Okay, and you can actually see the very bottom the boldface numbers are the ones that I want you to look at just look at the accuracy it's 88 to 69% so we beat our sweet by.

291

00:37:44.580 --> 00:37:45.510

John Schreck: 19 points.

292

00:37:46.770 --> 00:37:47.700

John Schreck: Next slide please.

293

00:37:49.470 --> 00:37:53.940

John Schreck: So these are just a couple examples of who's getting what wrong the top.

294

00:37:54.780 --> 00:38:03.690

John Schreck: row just shows you what hollow suites getting wrong, so these first two examples here are like one of them called a wave reflection so just something bounced off, I think the detector.

295

00:38:03.990 --> 00:38:10.200

John Schreck: And these goofy like funky looking patterns are on the top left one is not a tree particle the one right next to that one is actually an artist.

296

00:38:10.530 --> 00:38:17.400

John Schreck: fact know how there's like a little bright and next to it dark I don't know what it is we just I was informed by the more expertly.

297

00:38:17.670 --> 00:38:24.570

John Schreck: matt damon and an Aaron who are the real hologram guys here they know what these things are, that is not a real particle that we're interested in.

298

00:38:25.140 --> 00:38:30.570

John Schreck: The two after that show particles that we labeled to be true that hollis we just didn't get right for some reason.

299

00:38:31.200 --> 00:38:33.420

John Schreck: The bottom row shows the examples of the neural net.

300

00:38:33.810 --> 00:38:40.650

John Schreck: So the two there on the left or just kind of like he's blurry or half moon looking ones, and we really weren't sure like what to make of them like are they part.

301

00:38:40.950 --> 00:38:46.710

John Schreck: They could have had a particle that's like on a slightly out of focus and these were examples that were kind of near the edge of the holograms.

302

00:38:48.630 --> 00:38:54.840

John Schreck: Just to say like when you do manual labeling like an ain't perfect and like this is why we had to associate sort our confidence score.

303

00:38:56.250 --> 00:38:59.880

John Schreck: Now the two to the right of that are just examples of the neural net itself.

304

00:39:00.780 --> 00:39:06.210

John Schreck: right that the one the bigger one there's kind of blurry right it's not completely dark there's like a little bit of a bright spot in the middle.

305

00:39:06.660 --> 00:39:13.080

John Schreck: And the one over there on the way over there on right still has more like a noisy pattern in it, you kind of see like he's waving looking I don't know what to call them but.

306

00:39:13.410 --> 00:39:19.080

John Schreck: it's a noise pattern it's a very small particles so it's ones that we know the model is already going to have a little hard time predicting.

307

00:39:20.130 --> 00:39:20.820

John Schreck: Next slide please.

308

00:39:22.410 --> 00:39:22.800

John Schreck: That.

309

00:39:24.210 --> 00:39:29.970

John Schreck: I hope I didn't take up too much time, so just want to thank everyone who was involved in all this um yeah.

310

00:39:32.700 --> 00:39:35.430

Brian Vanderwende: Thanks guys yes perfect perfect timing.

311

00:39:36.600 --> 00:39:41.640

Brian Vanderwende: looks I mean we got plenty of time for if people have questions they want to feel towards.

312

00:39:43.470 --> 00:39:45.930

Brian Vanderwende: towards either of either of these aspects.

313

00:39:47.250 --> 00:39:48.240

Brian Vanderwende: john Dennis go ahead.

314

00:39:49.920 --> 00:39:55.530

John Dennis (he/him): yeah I was curious you were talking about all these calculations that you're performing.

315

00:39:56.760 --> 00:40:01.680

John Dennis (he/him): I assume, does the code run on cpu as well, and do you have comparisons.

316

00:40:05.580 --> 00:40:10.470

John Schreck: If I ran it on the cpu like the models you mean like actually evaluating the neural nets.

317

00:40:11.160 --> 00:40:21.990

John Dennis (he/him): Well, I mean I i'm just curious is the is, this is a gpu saving you a factor of four factor of two execution time.

318

00:40:23.010 --> 00:40:31.410

John Schreck: I mean, I would say, overall, I mean these models are so large, and I mean the data even handling these images are it's not an easy thing to do.

319

00:40:32.370 --> 00:40:36.930

John Schreck: I mean if everything I had no gpus at all, like the the project would not be able to be you couldn't do it.

320

00:40:37.470 --> 00:40:51.030

John Schreck: Like you need the gpus for the machine learning parts right for the neural net input output, other than that I, for the most part, like all that kind of pre processing and post processing, that is not wave propagation is mostly done on the cpu.

321

00:40:54.540 --> 00:40:55.680

David John Gagne: tap tap on that.

322

00:40:57.690 --> 00:41:05.700

David John Gagne: It depends on the depends on the size of the model, but by yeah the gpu can provide like 100,000 next.

323

00:41:07.350 --> 00:41:21.210

David John Gagne: Point pretty big orders of magnitude, to be a bit over like a single cpu obviously you could like do a distributed training across many cpus and the difference would decrease a fair bit.

324

00:41:22.800 --> 00:41:26.850

David John Gagne: But then you like it does increase the complexity of the machine learning workflow.

325

00:41:28.740 --> 00:41:32.820

David John Gagne: And power consumption and stuff like that, so the gpu definitely has a.

326

00:41:34.410 --> 00:41:51.300

David John Gagne: big advantage and i'm from a from the software perspective on the lake with things like spy torchy tensorflow you can run the exact same code or nearly the exact same code on cpu or gpu so he so it's possible to do these kinds of of of comparisons.

327

00:41:52.560 --> 00:41:58.560

David John Gagne: done it for like our goes benchmark I it's been a while, since, since I.

328

00:41:59.850 --> 00:42:04.380

David John Gagne: Looked at the remember the exact numbers, but it is definitely in the.

329

00:42:07.110 --> 00:42:09.720

David John Gagne: Probably one one cpu to to.

330

00:42:12.060 --> 00:42:22.140

David John Gagne: The training was fairly basic convolution neural network all on all like one cpu for it versus one like the 100 it was.

331

00:42:23.520 --> 00:42:27.450

David John Gagne: going from like a couple hours to a minute so.

332

00:42:29.250 --> 00:42:36.030

David John Gagne: And then add multiple gpus you can you can scale pretty well on a single node with the distributed and training.

333

00:42:39.630 --> 00:42:45.990

Brian Vanderwende: said, there was a follow up question in the chat from Brian he asks what precision are the data calculations and.

334

00:42:48.690 --> 00:42:49.710

David John Gagne: I think for.

335

00:42:51.120 --> 00:42:53.490

David John Gagne: All of our stuff is employed 32.

336

00:42:54.840 --> 00:42:55.410

David John Gagne: But.

337

00:42:56.490 --> 00:43:01.980

David John Gagne: Like I know in tensorflow and pie, tortures and support for like automatic mix precision, so you can.

338

00:43:02.880 --> 00:43:16.200

David John Gagne: it's fun the newer features I don't know they have turned on my only thing is turned on by default, though, but that can allow you to like us reduce precision, where it makes like epic have the tensorflow reply to figure out when to use reduce precision.

339

00:43:18.300 --> 00:43:22.530

David John Gagne: there's certainly a lot of people looking into ways to do that to maximize.

340

00:43:24.060 --> 00:43:29.790

David John Gagne: performance and reduce data storage and stuff like that john do you have anything to have a lot.

341

00:43:30.030 --> 00:43:32.340

John Schreck: yeah actually because the output, when the model like output.

342

00:43:32.670 --> 00:43:43.800

John Schreck: These masks um you know I said and output zeros and ones, but it really does is it outputs a number between zero and one and if it's less than, for instance, one half you label it zero it's greater than one half illegal one.

343

00:43:44.340 --> 00:43:58.830

John Schreck: But if I want to say about like all that data for like 1000 planes right and I want to save that with precision it's a tremendous amount I filled up my quota like multiple times so actually when I when I say that out only keep three significant figures.

344

00:43:59.940 --> 00:44:06.180

John Schreck: And it's just sort of a choice, but it's something that I needed to do if I wasn't going to keep running out of space.

345

00:44:07.440 --> 00:44:19.710

Brian Dobbins: So just to follow up on that, then so that's for the data for the calculations Is this something where you can go to or is that too low, precision i'm i'm just curious about the performance and the precision and you know how this all works.

346

00:44:20.250 --> 00:44:23.670

John Schreck: I suspect it's probably would be not too bad really probably comparable.

347

00:44:24.150 --> 00:44:31.080

John Schreck: I originally wanted to use the original like integer inputs right because the image was like zero to 255 pixel counts, but.

348

00:44:31.350 --> 00:44:39.720

John Schreck: We needed to do pre processing transformations and I noted that I had to add noise into the data during the training to get a performant on the real stuff.

349

00:44:40.080 --> 00:44:49.350

John Schreck: And that kind of throws off your ability to kind of stick with you know, unless you round which I haven't tried it but i've read some blogs, and things out there, where.

350

00:44:49.890 --> 00:44:59.040

John Schreck: It seems like it's worth the trade off right to not use the floats, especially in the input to a model, but I haven't tested that yet with like this particular project.

351

00:45:00.840 --> 00:45:02.430

David John Gagne: It may be worthwhile to do.

352

00:45:02.430 --> 00:45:03.300

David John Gagne: down the road as.

353

00:45:04.650 --> 00:45:07.650

David John Gagne: Well, maybe even test the automated next precision kind of.

354

00:45:08.670 --> 00:45:09.900

David John Gagne: frameworks and see if that.

355

00:45:11.670 --> 00:45:19.710

David John Gagne: i've seen other things that gives you can give you a pretty significant speed up just just turning that on without much of a loss and performance with anything.

356

00:45:21.870 --> 00:45:23.610

David John Gagne: that's like Thomas has a question.

357

00:45:25.290 --> 00:45:37.980

Thomas Hauser: yeah I think john I think you, you mentioned kind of I think you are bottleneck by the gpu availability is that did I understand it correctly, and how much could, if you have more gpu is how much would that speed up your work.

358

00:45:39.930 --> 00:45:53.700

John Schreck: i'm quite I mean, in this case for holiday probably a lot now I kind of cherry picked our holiday project because it is our gpu workhorse project, I would say, for now, I mean will be getting into heavier stuff later um.

359

00:45:54.810 --> 00:45:56.880

John Schreck: I mean, in the case, so the operational.

360

00:45:58.110 --> 00:46:06.090

John Schreck: The way that, like the AOL folks might use this would be to do 1000 constructions of the wave prop So if I had, for instance 1000 gpus.

361

00:46:06.630 --> 00:46:15.630

John Schreck: Then everything is done in one go right like it's and there's further parallelization that can happen, but I just didn't have enough time to get around to.

362

00:46:17.790 --> 00:46:18.030

John Schreck: Like.

363

00:46:18.720 --> 00:46:25.680

John Schreck: It would be great to use all of the ratio to try to process holograms like massively and try to tell the plane, where you go in real time.

364

00:46:27.150 --> 00:46:31.200

Mick Coady: Joe that this is MC I missed what you said how many more gpu.

365

00:46:32.070 --> 00:46:32.910

Mick Coady: You just mentioned me.

366

00:46:32.970 --> 00:46:39.720

John Schreck: In this case, we're in I usually what i'll say 500 because I can put two models and data on a gpu but I mean like you know.

367

00:46:40.860 --> 00:46:42.210

John Schreck: amazon's got that many right.

368

00:46:43.200 --> 00:46:44.760

John Schreck: But you know just to give you that idea.

369

00:46:44.790 --> 00:46:48.780

John Schreck: would be like hundreds to maybe 1000 in this particular case.

370

00:46:49.020 --> 00:46:50.670

John Schreck: And i'm not like a.

371

00:46:51.720 --> 00:47:03.180

John Schreck: You know I mean i've been writing code for a while, but i'm not like a software engineer by trade really so surely someone else is going to do a better job in some sense in certain areas I think in this code, where they have more experienced than me.

372

00:47:04.230 --> 00:47:07.200

John Schreck: Certainly I think of data prep and data handling.

373

00:47:09.090 --> 00:47:10.920

John Schreck: I know I should probably be using like our.

374

00:47:12.900 --> 00:47:23.190

John Schreck: I won't get into it right now, but I just to say I feel like even just with this pipeline like there could be like a number of cool things that we could probably do to it to like take advantage of more resources and parallelization.

375

00:47:25.380 --> 00:47:34.590

Brian Vanderwende: You refer to the you know this project if you were course you know when when when you're starting an ml project, and you have to make the decision whether the.

376

00:47:35.790 --> 00:47:43.830

Brian Vanderwende: four components that can be run on the gpu decide whether to use cpu or gpu resources is that more of a technical decision, right now, or is that more domains decision.

377

00:47:45.840 --> 00:47:59.460

John Schreck: Well, I mean, in the case here like I like I can do the wave prop calculation on the cpu or on the gpu it's quite fast on the gpu but not, these are still not be raised up pretty fast calculation or c++ in the back end anyways um I think.

378

00:48:01.380 --> 00:48:02.040

John Schreck: So, like.

379

00:48:03.300 --> 00:48:14.190

John Schreck: There are complications, for instance, where i'm trying to put too many things on the gpu and the gpu will be like nah, and this is kind of what I was getting into earlier about like me having sometimes a hard time estimating how many resources, I need up front.

380

00:48:15.870 --> 00:48:26.220

John Schreck: So there i'll sacrifice a performance right in some cases i'll just do the way for up on the cpu right and just wait, you know and it's not that big of a deal.

381

00:48:27.840 --> 00:48:33.960

John Schreck: i'm there i'm specifically alluding to like when i'm actually training a model and you saw the result was where I was like yeah right, you know I mean.

382

00:48:35.130 --> 00:48:42.180

John Schreck: sped it up and and a bunch of different other ways, and in some sense like just trying to train this model once while really I didn't like 500 times, but I think the best.

383

00:48:42.810 --> 00:48:50.880

John Schreck: Right and like, but in some sense like I don't have to keep doing that, over and over and over right, so I just dealt with it in that case.

384

00:48:52.650 --> 00:48:53.850

John Schreck: took the performance it yeah.

385

00:48:54.750 --> 00:48:55.530

Brian Vanderwende: That makes sense.

386

00:48:56.910 --> 00:48:59.040

David John Gagne: and probably a more general context it's.

387

00:49:00.660 --> 00:49:19.260

David John Gagne: The way, as I mentioned that kind of the beginning, I think the the best supported geek pieces with machine learning is kind of the brain training inference work workflow but There certainly are fiercely more opportunities to use it and the other parts of the pipeline it's just.

388

00:49:20.640 --> 00:49:32.850

David John Gagne: Like reasons we haven't done is because the the software to support it is relatively new like the rapids kodiak base announced numeric two weeks ago, so we haven't had a chance here not too bad yeah.

389

00:49:34.830 --> 00:49:36.060

David John Gagne: And and.

390

00:49:37.140 --> 00:49:44.280

David John Gagne: kind of numb pine pandas like the data processing on cpu based like it it's relatively straightforward right and.

391

00:49:45.390 --> 00:49:48.090

David John Gagne: Like haven't seen as much of the need for the speed up there, but.

392

00:49:48.900 --> 00:49:58.500

David John Gagne: whether I could see being more useful like like interpolation, for instance, or regretting that's a it's a very intensive calculation, I can be pretty slow on.

393

00:49:59.400 --> 00:50:06.840

David John Gagne: cpu based off that and some of it can be paralyzed i'm sure, and could benefit from the, by the way the gpu could do for it.

394

00:50:07.230 --> 00:50:15.810

David John Gagne: And it's something we're gonna have to do and for scaling up to bigger bigger machine learning problems like trying to run and say you know over the globe, or something, then.

395

00:50:19.650 --> 00:50:22.200

David John Gagne: I think it's the direction we need to look into at least.

396

00:50:23.700 --> 00:50:33.120

David John Gagne: The hard part I think for our I guess from from like there's more people support problem it's just like it's like these are additional things that people have to learn and.

397

00:50:34.620 --> 00:50:37.050

David John Gagne: supporting us in our group, like.

398

00:50:39.300 --> 00:50:49.590

David John Gagne: When we were learning a lot and pick up a lot of different different skill sets so it's like what do we do prioritize and learn more about the domain science or more about like this others just have a gpu.

399

00:50:50.550 --> 00:51:03.360

David John Gagne: Like package or so so from our perspective, and then we want to support like say other scientists out there on a car that are machine learning experts are not like uber Python power users, then.

400

00:51:05.010 --> 00:51:23.250

David John Gagne: The adding having to deal with the gpu and to the next word way that's not like hidden under the hood by layers of abstraction maybe maybe may not be a good sell because it's because of the extra overhead and we were uncomfortable yeah.

401

00:51:24.840 --> 00:51:29.010

Brian Vanderwende: I saw seen a put her hand up briefly, I do still have a comment or question.

402

00:51:32.790 --> 00:51:32.970

Cena: yeah.

403

00:51:34.230 --> 00:51:36.450

Cena: Maybe missed it but i'm.

404

00:51:37.590 --> 00:51:41.190

Cena: For that data augmentation and the noise adding.

405

00:51:43.380 --> 00:51:45.750

Cena: So, was that on gpu or cpu.

406

00:51:47.100 --> 00:51:47.790

John Schreck: noise.

407

00:51:51.390 --> 00:51:58.320

John Schreck: That actually did on the on the cpu, and the reason why is because, when i'm prepping the data I parallelize that.

408

00:51:58.830 --> 00:52:04.800

John Schreck: And I need to do that on the cpu if I do that on the gpu and Malta, and also their stuff it's too many things to handle for me.

409

00:52:05.310 --> 00:52:16.980

John Schreck: So I that's like another area where i'll like Okay, if I did it on the gpu it's like a worth maybe eight cpu workers, and I would have otherwise spawn so i'll just ask for 16 you know or something like that to kind of balance that out.

410

00:52:17.580 --> 00:52:22.290

John Schreck: It would be nice to be able to do everything on the gpu though right it's just going to give john says.

411

00:52:22.860 --> 00:52:32.760

John Schreck: You know the machine learning pipelines, you know they get caught sort of complicated, especially in this case right and like even just sort of keep a track like what's going on, sometimes is is a bit of a challenge.

412

00:52:35.580 --> 00:52:39.120

Brian Vanderwende: And then I think we'll wrap up this topic with some brief.

413

00:52:41.190 --> 00:52:53.400

Supreeth Madapur Suresh: yeah Thank you, thank you for the presentations I have a quick follow up or DJ maybe I misunderstood this did did you say you tried the rapids or you haven't tried or you're planning to try.

414

00:52:54.420 --> 00:52:56.640

David John Gagne: It we have I haven't tried it yet.

415

00:52:59.580 --> 00:53:16.320

David John Gagne: it's not like to experiment with it, but we don't have like a defined plan of like burgers sort of in Berlin dealing with a backlog of other stuff but but it's on my radar to try, or maybe get someone in the group to be like hey we got a company first class coming in, so.

416

00:53:17.580 --> 00:53:25.080

David John Gagne: Maybe you could test for them to to mess around with some some other stuff and see if they can incorporate into their workflows.

417

00:53:26.100 --> 00:53:28.950

David John Gagne: There were 12 for them, then we you know deploy them will rightly.

418

00:53:31.170 --> 00:53:46.110

Supreeth Madapur Suresh: Okay, because me and seen that we have been working with trap it's for a couple of years now, so if if you need any help with that are any advice would be happy to work with you guys are the new postdocs.

419

00:53:46.710 --> 00:53:48.510

David John Gagne: yeah definitely appreciate that.

420

00:53:50.190 --> 00:53:53.250

David John Gagne: Which parts are rapids have you been have you been using.

421

00:53:54.150 --> 00:54:05.070

Supreeth Madapur Suresh: Lately, the data frames and co pi individually inside that but we always wanted to try some of the machine learning functions inside the rabbit.

422

00:54:05.790 --> 00:54:16.950

Supreeth Madapur Suresh: That we didn't have any code wet So if you have a small example we could try it that's all our time our if you have new postdocs coming in, will be happy to work with them.

423

00:54:18.210 --> 00:54:26.100

David John Gagne: yeah certainly try to try to follow up with the sat there with the botox starting in early January so so you try to follow up after the holidays, I think.

424

00:54:28.950 --> 00:54:34.680

David John Gagne: But with the ACER kuti how much of a speed up have you seen using that versus like pandas.

425

00:54:35.880 --> 00:54:40.860

David John Gagne: And is there any like pain points are a challenge to you and getting it to work.

426

00:54:42.630 --> 00:54:55.980

Supreeth Madapur Suresh: Actually, we tried this for the first time with a soapbox student and we have the presentation, with a nice didn't lead ads about the performance and also what, what are we have tried i'd be happy to share that presentation with you, after the call.

427

00:54:56.460 --> 00:54:57.240

David John Gagne: yeah that'd be great.

428

00:55:00.000 --> 00:55:04.830

Brian Vanderwende: Alright well thanks john thanks David john for sharing the presentation.

429

00:55:06.540 --> 00:55:09.360

Brian Vanderwende: and appreciate the cut the questions from everybody in the group.

430

00:55:10.380 --> 00:55:30.210

Brian Vanderwende: So I think i'll skip the the training topic, right now, I think we have a lot more to say about that, and the next gtd, but I just wanted to leave a couple minutes for Roundtable if anybody has any sort of other comments, either on the machine learning subject or on gpu in general.

431

00:55:35.040 --> 00:55:37.590

Brian Vanderwende: Now we're right up against our time limit so that's fine too.

432

00:55:40.260 --> 00:55:41.160

Mick Coady: Well done burn.

433

00:55:41.640 --> 00:55:44.190

Brian Vanderwende: yeah yeah at a time to do with it.

434

00:55:46.020 --> 00:55:55.770

Mick Coady: Thanks DJ and john I didn't get to catch much, but I plan to listen in on the recordings so.

435

00:55:56.790 --> 00:55:58.860

Mick Coady: really appreciate the effort it looks really good.

436

00:56:00.180 --> 00:56:02.310

David John Gagne: Thank you mess up your.

437

00:56:03.900 --> 00:56:05.490

David John Gagne: plumbing issues are dangers.

438

00:56:06.840 --> 00:56:13.020

Mick Coady: There there close neck now just got figure, how to pay for it, I might not be able to retire now I don't know.

439

00:56:15.840 --> 00:56:18.270

Brian Vanderwende: Some that scary thought will wrap up the meeting.

440

00:56:19.020 --> 00:56:19.380

Mick Coady: i'll just.

441

00:56:20.130 --> 00:56:28.350

Mick Coady: point out that are our next meeting will be in January and it's set for January six so.

442

00:56:29.400 --> 00:56:39.660

Mick Coady: look forward to seeing everybody back back here, then, and hopefully my plumbing problems will be well in the past, so.

443

00:56:41.040 --> 00:56:42.690

Mick Coady: appreciate appreciate everybody's time.

444

00:56:43.830 --> 00:56:44.730

Brian Vanderwende: Have a good holiday everybody.

445

00:56:45.150 --> 00:56:46.290

Mick Coady: yeah take care.

446

00:56:52.710 --> 00:56:54.150

Brian Vanderwende: plan or in the background next.

447

00:56:56.520 --> 00:57:00.300

Mick Coady: He should be happy now he is good, he was actually pretty good.

448

00:57:01.500 --> 00:57:01.980

Mick Coady: Very good.

Page tree

12.2.21 - Audio recording text transcript