for i inrange(2): print(instruct_tune_dataset['train']['prompt'][i]) print('---'*3)
数据预处理
对于Mixtral模型,数据集需要按照以下格式准备:
1 2 3 4 5
<s>[INST] Use the provided input to create an instruction that could have been used to generate the response with an LLM.
{input} [/INST]
{response}</s>
而我们已有的数据集长这样:
1
instruct_tune_dataset["train"][0]
1 2 3
{'prompt': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction\nWhat are different types of grass?\n\n### Response\n', 'response': 'There are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.', 'source': 'dolly_hhrlhf'}
定义一个处理的函数:
1 2 3 4 5 6 7 8 9 10
defcreate_prompt(sample): bos_token = "<s>" original_system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request." system_message = "[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM." response = sample["prompt"].replace(original_system_message, "").replace("\n\n### Instruction\n", "").replace("\n### Response\n", "").strip() input = sample["response"] eos_token = "</s>" full_prompt = bos_token + system_message + "\n" + input + "[/INST]" + response + eos_token
return {"full_prompt": full_prompt}
拿之前的数据测试一下:
1
create_prompt(instruct_tune_dataset["train"][0])
1
'<s>[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM.\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[/INST]What are different types of grass?</s>'
prompt="""[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM. \nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[\INST]"""
model = prepare_model_for_kbit_training(model) # 用来使得模型能够训练在4Bits精度 model = get_peft_model(model, peft_config)
打印一下可训练参数数量。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
defprint_trainable_parameters(model): """ Prints the number of trainable parameters in the model. """ trainable_params = 0 all_param = 0 for _, param in model.named_parameters(): all_param += param.numel() if param.requires_grad: trainable_params += param.numel() print( f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}" )
print_trainable_parameters(model)
1
trainable params: 56836096 || all params: 23539437568 || trainable%: 0.24145052674182907
设置训练超参数
还可以设置一些训练超参数:
num_train_epochs/max_steps: 数据迭代次数,如果过高会造成过拟合。
1 2 3 4
if torch.cuda.device_count() > 1: # If more than 1 GPU print(torch.cuda.device_count()) model.is_parallelizable = True model.model_parallel = True
prompt = "[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM.\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[/INST]"