1991: first neural network distillation [1-3]. I called it "collapsing," back then, not “distilling." References [1] J. Schmidhuber (1991). Neural sequence chunkers. Tech Report FKI-148-91, Tech Univ. Munich. Sec. 3.2.2. & Sec. 4 are about "collapsing" or "distilling" or "compressing" the knowledge of a neural network into another neural network. [2] JS (1992). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on [1]. [3] JS (AI Blog, 2021, updated 2025). 1991: First very deep learning with unsupervised pre-training. First neural network distillation.
16,44K