In general neural networks have a sort of loss like that:

However, The part of the denominator integral is intractable of finding an analytic solution solution in practice. Therefore, we are going to make a distribution approaching the original distribution. KL divergence can be used to indicate the difference between these two distributions.