Neural Compression Limits in Parameter-Efficient AI Foundation Models
Keywords:
Neural Compression, Representation Stability, Parameter-Efficient ModelsAbstract
Parameter-efficient neural compression techniques are increasingly used to reduce the computational
cost of deploying foundation models in large-scale, real-time inference environments. However,
compression modifies internal representational structures, influencing semantic coherence, reasoning
continuity, and long-term stability. This study evaluates structured pruning, low-rank factorization,
quantization, and sparse expert routing approaches to identify the boundary conditions under which
compression preserves versus degrades latent representation geometry. Results show that low-rank
approximation maintains stable semantic structure and inference robustness, while aggressive pruning
and low-precision quantization introduce cumulative representational drift over time. The findings
highlight that compression must be managed with awareness of representational topology, temporal
workload patterns, and operational reliability requirements, particularly when models are integrated
into interactive or regulated enterprise systems.