大模型入门(四)—— 基于peft 微调 LLaMa模型

avatar
作者
筋斗云
阅读量:7

llama-7b模型大小大约27G,本文在单张/两张 16G V100上基于hugging face的peft库实现了llama-7b的微调。

1、模型和数据准备

使用的大模型:https://huggingface.co/decapoda-research/llama-7b-hf,已经是float16的模型。

微调数据集:https://github.com/LC1332/Chinese-alpaca-lora/blob/main/data/trans_chinese_alpaca_data.json

微调的代码已上传到github:https://github.com/jiangxinyang227/LLM-tuning/tree/master/llama_tuning

2、微调技巧

1)lora微调。float16的模型刚刚好存放在16G的GPU上,没有太多显存用于存放梯度、优化器等参数,因此在这里使用lora微调部分参数。

2)混合精度训练,因为llama-7b有27g,想在单张V100上加载就需要转换成float16才行,而lora参数用的是float32,需要使用混合精度训练。同时混合精度训练也会有所加速。

3)梯度累积,单张gpu在存放完模型参数,lora参数、梯度、优化器等参数之后只剩下很少的显存给到输入输出等中间变量,经测试单张V100的极限大致是batch size=1,sequence length=200,只能使用梯度累积实现mini-batch训练。

4)当有多张卡时,可以使用数据并行、模型并行等方法微调,数据并行只是将模型复制到每张GPU上,因此单张GPU的batch size仍然只能是1,模型并行会将模型均分到每个GPU上,可以增大每张GPU上的batch size,在2张V100上测试了ddp(数据并行)和 基于zero-3 + cpu offload(数据并行+模型并行+CPU)。

3、要注意的代码讲解

3.1 data_helper.py

data_helper.py中主要注意下tokenizer()函数,一是padding是在左边padding,和我们通常的右边padding不太一样;二是labels中的pad_id=-100,因为pytorch中label=-100时不参与loss的计算。

def tokenize(self, prompt, add\_eos\_token=True):         # there's probably a way to do this with the tokenizer settings         # but again, gotta move fast         result = self.tokenizer(             prompt,             truncation\=True,             max\_length\=self.sequence\_len,             padding\=False,             return\_tensors\=None         )         input\_ids, attention\_mask, labels \= \[\], \[\], \[\]         if (             result\["input\_ids"\]\[-1\] != self.eos\_token\_id             and len(result\["input\_ids"\]) < self.sequence\_len             and add\_eos\_token         ):             result\["input\_ids"\].append(self.eos\_token\_id)             result\["attention\_mask"\].append(1)                  pad\_len \= self.sequence\_len - len(result\["input\_ids"\])         if pad\_len <= 0:             input\_ids \= result\["input\_ids"\]\[:self.sequence\_len\]             attention\_mask \= result\["attention\_mask"\]\[:self.sequence\_len\]             labels \= input\_ids.copy()         else:             input\_ids \= \[self.pad\_token\_id\] \* pad\_len + result\["input\_ids"\]             attention\_mask \= \[0\] \* pad\_len + result\["attention\_mask"\]             labels \= \[self.label\_pad\_token\_id\] \* pad\_len + result\["input\_ids"\]                      return input\_ids, attention\_mask, labels 

3.2 metric.py

在指标计算中只实现了准确率,在这里要注意的是生成任务是前n-1个token生成第n个token,因此这里的预测结果和label要做一次不同的移位,即

pred_y = pred_y[:-1]

true_y = true_y[1:]

只要注意这里就好了,剩下的你需要计算什么指标都可以。

def accuracy(pred\_ys, true\_ys, masks):     total \= 0     corr \= 0      for pred\_y, true\_y, mask in zip(pred\_ys, true\_ys, masks):         # 做一层转换,让生成的结果对应上预测的结果,即前n-1个token预测第n个token         pred\_y = pred\_y\[:-1\]         true\_y \= true\_y\[1:\]         mask \= mask\[:-1\]                  for p, t, m in zip(pred\_y, true\_y, mask):             if m == 1:                 total += 1                 if p == t:                     corr += 1          return corr / total if total > 0 else 0 

4、训练方式

4.1 单GPU训练

单GPU训练很好理解,训练的时候只要注意下面的一段代码即可,混合精度训练+梯度累积

                with autocast():                      loss, predictions \= self.model(input\_ids, attention\_mask, labels)                  # 梯度累积训练                 loss /= self.accu\_steps                 # loss.backward()                  # 放大loss,并求梯度                 scaled\_loss = self.scaler.scale(loss)                 scaled\_loss.backward()                  if current\_step % self.accu\_steps == 0:                     # 先将梯度缩放回去,再执行梯度裁剪                     self.scaler.unscale\_(self.optimizer)                      clip\_grad\_norm\_(self.model.parameters(), 1.0)                      self.scaler.step(self.optimizer)                      self.scheduler.step()                     self.scaler.update()                     self.optimizer.zero\_grad() 

4.2 多GPU + DDP训练

DDP训练也是大家最常用的方法,尤其是在模型没那么大的情况下,DDP训练就是主流,就不多赘述,在这里值得注意的是,每个GPU会分担一部分数据,在验证的时候如果需要拿到全部数据的验证结果并输出时,需要通过dist.all_gather 或者 dist.gather的方法将验证集的结果聚合到一块。详细代码见https://github.com/jiangxinyang227/LLM-tuning/blob/master/llama_tuning/lora_ddp/trainer.py

def eval(self):         self.model.eval()         with torch.no\_grad():             eval\_losses \= \[\]             eval\_word\_preds \= \[\]             eval\_word\_labels \= \[\]             eval\_masks \= \[\]             for batch\_data in self.valid\_data\_loader:                 input\_ids \= batch\_data\[0\].cuda()                 attention\_mask \= batch\_data\[1\].cuda()                 labels \= batch\_data\[2\].cuda()                  with autocast():                     loss, predictions \= self.model(input\_ids, attention\_mask, labels)                  # 获取所有gpu上输出的数据                 avg\_loss\_multi\_gpu = reduce\_value(loss, average=True)                 gather\_preds \= \[torch.zeros\_like(predictions, dtype=predictions.dtype) for \_ in range(Config.world\_size)\]                 gather\_labels \= \[torch.zeros\_like(labels, dtype=labels.dtype) for \_ in range(Config.world\_size)\]                 gather\_masks \= \[torch.zeros\_like(attention\_mask, dtype=attention\_mask.dtype) for \_ in range(Config.world\_size)\]                 gather\_value(predictions, gather\_preds)                 gather\_value(labels, gather\_labels)                 gather\_value(attention\_mask, gather\_masks)                  eval\_losses.append(float(avg\_loss\_multi\_gpu))                 for pred, label, mask in zip(gather\_preds, gather\_labels, gather\_masks):                     eval\_word\_preds.extend(pred.tolist())                     eval\_word\_labels.extend(label.tolist())                     eval\_masks.extend(mask.tolist())                          if is\_main\_process():                 acc \= accuracy(pred\_ys=eval\_word\_preds, true\_ys=eval\_word\_labels, masks=eval\_masks)                  logger.info("\\n")                 logger.info("eval: num: {},  loss: {}, acc: {}".format(                     len(eval\_word\_preds), mean(eval\_losses), acc))                 logger.info("\\n") 

4.3 deepspeed的zero-3 + cpu offload

在这里使用的是hugging face的accelerate库中的deepspeed方法,zero-3会将模型、梯度、优化器参数都分割到不同的GPU,并且使用cpu offload将一些中间变量放到cpu上,经实测使用两张GPU时,每张GPU的使用大概5个G多一点,单张卡的batch size可以设置到8,但是在实际训练过程中速度比DDP还要慢一点,这里的原因还是因为模型并行、CPU offload等带来了大量的通信工作,所以单张gpu能存放一整个模型时还是首推DDP。

使用accelerate中的deepspeed时,首先要通过accelerate config这个命令互动式配置训练参数,以下是我在配置时选择的参数

在使用deepspeed时可以通过json文件去配置其他参数,accelerate config只配置一些通用参数。zero-3 + cpu offload的json文件如下,配置的时候有几个参数(如allgather_bucket_size 和 reduce_bucket_size)要设小一点,不然显存会爆掉,默认的值会比较大,主要是V100太小了。

{     "fp16": {         "enabled": true,         "loss\_scale": 0,         "loss\_scale\_window": 1000,         "initial\_scale\_power": 16,         "hysteresis": 2,         "min\_loss\_scale": 1     },     "optimizer": {         "type": "AdamW",         "params": {             "lr": 3e-4,             "weight\_decay": 0.0         }     },     "scheduler": {         "type": "WarmupDecayLR",         "params": {             "warmup\_min\_lr": "auto",             "warmup\_max\_lr": "auto",             "warmup\_num\_steps": "auto",             "total\_num\_steps": "auto"         }     },     "zero\_optimization": {         "stage": 3,         "offload\_optimizer": {             "device": "cpu",             "pin\_memory": true         },         "offload\_param": {             "device": "cpu",             "pin\_memory": true         },         "overlap\_comm": true,         "contiguous\_gradients": true,         "allgather\_bucket\_size": 1e6,  # 参数要小,不然容易内存爆掉         "reduce\_bucket\_size": 1e6,  # 参数要小,不然容易内存爆掉         "stage3\_prefetch\_bucket\_size": 1e6,  # 参数要小,不然容易内存爆掉         "stage3\_param\_persistence\_threshold": 1e6,  # 参数要小,不然容易内存爆掉         "sub\_group\_size": 1e9,         "stage3\_max\_live\_parameters": 1e9,         "stage3\_max\_reuse\_distance": 1e9,         "stage3\_gather\_16bit\_weights\_on\_model\_save": true     },     "gradient\_accumulation\_steps": 1,     "gradient\_clipping": 1.0,     "steps\_per\_print": 2000,     "train\_batch\_size": "auto",     "train\_micro\_batch\_size\_per\_gpu": "auto",     "wall\_clock\_breakdown": false } 

在使用的时候有一个问题一直没有解决,保存模型时,保存完之后会出现GPU1掉线的情况,所以在这里将保存模型放在整个训练结束后保存,这个问题还没找到解决的办法,有知道怎么解的还麻烦指导下。

如果在运行时报这样的错误的话:

Traceback (most recent call last):   File "/mnt/workspace/project/llm/local\_proj/chatglm\_tune/lora\_deepspeed/trainer.py", line 271, in <module>     main()   File "/mnt/workspace/project/llm/local\_proj/chatglm\_tune/lora\_deepspeed/trainer.py", line 265, in main     trainer \= Trainer()   File "/mnt/workspace/project/llm/local\_proj/chatglm\_tune/lora\_deepspeed/trainer.py", line 93, in \_\_init\_\_     self.model, self.optimizer, self.train\_data\_loader, self.valid\_data\_loader, self.scheduler \= self.accelerator.prepare(   File "/home/pai/lib/python3.9/site-packages/accelerate/accelerator.py", line 1118, in prepare     result \= self.\_prepare\_deepspeed(\*args)   File "/home/pai/lib/python3.9/site-packages/accelerate/accelerator.py", line 1415, in \_prepare\_deepspeed     engine, optimizer, \_, lr\_scheduler \= deepspeed.initialize(\*\*kwargs)   File "/home/pai/lib/python3.9/site-packages/deepspeed/\_\_init\_\_.py", line 165, in initialize     engine \= DeepSpeedEngine(args=args,   File "/home/pai/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in \_\_init\_\_     self.\_configure\_optimizer(optimizer, model\_parameters)   File "/home/pai/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1173, in \_configure\_optimizer     self.optimizer \= self.\_configure\_zero\_optimizer(basic\_optimizer)   File "/home/pai/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1463, in \_configure\_zero\_optimizer     optimizer \= DeepSpeedZeroOptimizer\_Stage3(   File "/home/pai/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 298, in \_\_init\_\_     largest\_partitioned\_param\_numel \= max(\[   File "/home/pai/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 299, in <listcomp>     max(\[max(tensor.numel(), tensor.ds\_numel) for tensor in fp16\_partitioned\_group\]) ValueError: max() arg is an empty sequence 

具体原因不知道为什么会导致这样,可以进入到/home/pai/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py(具体的路径看报错的日志)文件中,将

largest\_partitioned\_param\_numel = max(\[             max(\[max(tensor.numel(), tensor.ds\_numel) for tensor in fp16\_partitioned\_group\])             for fp16\_partitioned\_group in self.fp16\_partitioned\_groups         \]) 

改成

largest\_partitioned\_param\_numel = max(\[             max(\[max(tensor.numel(), tensor.ds\_numel) for tensor in fp16\_partitioned\_group\])             for fp16\_partitioned\_group in self.fp16\_partitioned\_groups if len (fp16\_partitioned\_group) > 0         \]) 

即可运行。

广告一刻

为您即时展示最新活动产品广告消息,让您随时掌握产品活动新动态!