大模型压缩：使用Fisher信息从低秩表示模型_业界新闻

发布时间:2024-07-18 00:19

阅读量:1

费雪信息

之前写过的文章20240616日志：大模型压缩方法DMS里具体介绍了费雪信息，在一组观测数据中，Fisher信息量越大，对未知参数的估计就越准确。
$I_w^{\text{def}}=\mathbb{E}\left[\left(\frac{\partial}{\partial w}\log p(\mathcal{D}|w)\right)^2\right]\tag{1}$
但是，Fisher信息量计算代价太大。

FWSVD: Fisher-Weighted SVD

寻求一个基于经验的费雪信息量 $I_w^{\mathrm{emp}}$ ，用公式2表示
$I_w^{\mathrm{def}}\approx I_w^{\mathrm{emp}}=\frac{1}{|\mathcal{D}|}\sum_{i=1}^{|\mathcal{D}|}\left(\frac{\partial}{\partial w}\mathcal{L}\left(d_i;w\right)\right)^2\tag{2}$
令 $\hat{w} = \mathrm{SVD}(w)$ ，由此目标函数定义为
$\min_{\text{rank }\hat{w}=r}\|\sqrt{I_w^\text{emp}}*(w-\hat{w})\|^2\tag{3}$
对 $I_w^{\mathrm{emp}}$ 使用行加权
$\hat{I}_w^{\mathrm{emp}}=\mathrm{diag}\left(I_w^{\mathrm{emp}}\cdot\mathbf{1}\right)\tag{4}$
由此可得加权的SVD
$\mathrm{FWSVD}(w)\approx\hat{U}\hat{\Sigma}\hat{V}=(\hat{I}_{w}^{\mathrm{emp}})^{-1}U\Sigma V\tag{5}$

增强LoRA FWSVD压缩

为了不计算每一个权重的信息量，这里引入
$\hat{I}_w^{\mathrm{emp}}\approx\hat{I}_{\Delta w}^{\mathrm{emp}}=\hat{I}_B^{\mathrm{emp}}\hat{I}_A^{\mathrm{emp}}\tag{6}$
这里的"≈"的意思是使用后面的 $\hat{I}_{\Delta w}^{\mathrm{emp}}$ 去近似前面的 $\hat{I}_w^{\mathrm{emp}}$ ， $\Delta w$ 的意思并不是变化率。使用LoRA的思想进行微调，然后把 $w+\Delta w$ 代替原来的 $w$ ，然后使用 $\hat{I}_{\Delta w}$ 压缩
$\begin{aligned} \mathrm{FWSVD}(w)& \approx\mathrm{FWSVD}(\Delta w) \\ &=\mathrm{SVD}(\hat{I}_{\Delta w}^{\mathrm{emp}}(w+\Delta w)) \\ &=\mathrm{SVD}(\hat{I}_{\Delta w}^{\mathrm{emp}}w) \end{aligned}\tag{7}$
reference
[1]:ACL 2024 Parameter and Memory Efficient Language Model Compression using Fisher Informationfrom Low-Rank Representations