With the release of ChatGPT, it looks like AI tools have kind of officially made their way into the lay-person’s workflow. Let’s take the opportunity to get ahead of the dystopian train here, and maybe take a minute to demystify some of the mechanisms at work in these models. I’m hoping this could be the first spark of insight for some of the folks ooh-ing and aah-ing at the elusive inner-workings of ChatGPT.

Here’s what’s on the docket: we’re going to hardcode up a neural network in Python. It’ll be a real simple one: no frameworks, no Greek letters, no… fancy-pants terminology (blech). Although It will, technically, be a neural network. Again, real simple: Just numbers and results. You’ll need a basic understanding of Python, but frankly you could probably stay afloat without it. Maybe by the end, the mysticism of AI won’t have to feel so ominous.

What’s the Problem? Where’s the Data?

Neural networks solve problems. What’s ours? In the spirit of keeping things bare-bones, we’ll be doing the following: we’re given a number x as input, and a corresponding output y. We’ll be teaching the network to predict the slope and bias of a line. Contrived? Maybe. I said this was going to be simple, not realistic. What we’re doing is essentially linear regression: shimmying a line that best strikes through a collection of data points in an iterative fashion.

This is like that PyTorch tutorial where you fit a network to the sine function, except much, much simpler.

The data

…can be downloaded here. As promised: x’s and y’s. Nothing more.

Build A Network: Step by Step

Righty-oh… Let’s build this thing.

What’s a Network?

In our case, when I say network, I’m referring to (1) parameters and (2) rules about how to apply them. Watch out, there’s a little bit of terminology coming your way (I know, I lied earlier). Our network is made up of parametersspecifically, two: a weight and a bias. What are these parameters? They’re numbers. In our case, to keep things simple, integers. (In the real world, parameters are more likely to be expressed as floating point numbers, or something even more exotic… but, again, we promised simplicity here.)

How do we choose these integers? Randomly. How our network functions as follows: we take our input x, multiply it by the weight (we’ll call it w), add the result to the bias (or b), and the result is our output y. For the more mathematically inclined:

So, quick summary: our network is essentially just 2 integers, w and b. Such that for a given input x returns the y according to equation (1)

We have a network, which does operations on inputs– our Barbies. We also have our data, which models how we want our network to behave– our Kens. So how do we make them kiss?

Training

…is the process by which we adjust w and b until they have values such that xs produce the ys in our data. This is where I’ll show you how it works, step–by-step. The companion notebook with the code is right here, in case you want a peek inside the box. But first, let’s get a high-level overview:

  • Step 1: Initialize our parameters, w and b, to random values.

  • Step 2: Predict: For each row in the data, calculate and store y from equation (1), where x is a value from the first column of our data.

  • Step 3: Evaluate our calculated (predicted) ys against the “true” ys in our data.

  • Step 4: Update the network parameters to nudge ourselves towards a more accurate result.

Repeat from Step 2 onward until the predicted ys match the true ys.

Step 1: Parameter Initialization

Easy enough. I got w=25 and b=85.

Step 2: Run it Through

For the first row in our data, x=0.66, so substituting that for x in equation (1), along with the values for w and b from the previous step, we get:

Great, so now we have our very first guess. Is it good?

Step 3: How did we do?

For the first row in our data, we have x=0.66 and y=101.46 (which we’ll round to 101.5). Our network is way off, which is expected. Since we chose our parameters randomly, this first pass is a blindfolded swing. If it were right, that would be bananas.

The next step is to quantify how wrong we are. Watch out, more math coming your way:

Where:

Error is a measure of how wrong the network is, and you observed is our network’s output y. In our case, 101.5.

Think of this as a way of reprimanding poor performance: a scarlet numeral stitched on your network’s garment. A large (or high negative) Error indicates a very inaccurate output, and the error will shrink as the network undergoes training. It stands to reason that we would ideally, for a perfectly-trained network, get an Error of zero.

Think of this as a way of reprimanding poor performance: a scarlet numeral stitched on your network’s garment. A large (or high negative) Error indicates a very inaccurate output, and the error will shrink as the network undergoes training. It stands to reason that we would ideally, for a perfectly-trained network, get an Error of zero.

Step 4: Update the Parameters

Having computed the Error, we know how incorrect our network is, but we don’t know how to use that information to improve it. That’s what this step is about.

This is where it starts to get a little hairy. We want to update the parameters in proportion to the Error, inasmuch as they contributed to the Error.

Huh?

You can think of the philosophy of parameter updates as breaking down into two principles:

  1. The greater the Error, the more you adjust the parameters, and

  2. The more an individual parameter contributes to said Error, the more that parameter should be adjusted

The mathematical technicalities of computing this are based on some complicated calculus, which is, in my opinion, not only extraneous to the scope of this project but a distraction from what we’re trying to figure out here. Instead, we can break our problem down into 2 questions:

  1. What do we want to achieve? We want to me minimize the Error

  2. What are the tools available to us? We can tweak the parameters.

This states the problem as follows: How do we update our parameters in such a way that minimizes the Error. In other words, for a parameter, say, w, do we increase or decrease it, and by how much? The answer is as follows:

Where mean(Error ⋅ x) takes the mean of Error ⋅ x across all the rows in the data.

So, for the first row of our data, per the definition of Error in (2):

…and that’s averaged across all the data, the result of which is subtracted from w. Same goes for bias, following (4).

Conclusion

And that’s it. Repeat steps 2-4 over and over again until your network learns. How many times? Think of it like real learning: there’s no definitive stopping point. It’s up to you to decide when the lesson’s stuck.

I imagine you’re wondering where equations (3) and (4) come from, and why they’re different. To be honest, I agonized a little over how to broach this: I can’t give a proper motivation without going through the calculus. I can’t give a proper intuition about it without a long, this-should-be-its-own-article explanation. For the time being, you’ll just have to accept this as reality. 

In the long run the moral is this: in machine learning there’s no simple explanation that doesn’t take you down the rabbit hole. That’s as inspiring as it is discouraging. No, there isn’t a hack to understanding these concepts all at once, but if you keep an open mind, and keep pulling at the thread, you’ll be amazed at how much knowledge you can accumulate.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo