Sirdoonka Artificial
Wanaajinta xusuusta ee Farshaxanka Model Luqadda Weyn iyo hagaajinta Fiican
Moodooyinka luqadaha waaweyn (LLMs) sida GPT-4, Bloom, iyo LLAMA waxay ku guulaysteen awoodo cajiib ah iyagoo kor u qaaday balaayiin cabbirro ah. Si kastaba ha ahaatee, keenista moodooyinkan aadka u aadka u wayn ee soo jeedinta ama hagaajinta wanagsan ayaa ah mid adag sababtoo ah shuruudaha xusuusta ee aadka u wayn. Blog-gan farsamo, waxaanu ku sahamin doonaa farsamooyinka lagu qiyaaso oo lagu wanaajiyo isticmaalka xusuusta inta lagu jiro dhexgalka LLM iyo hagaajinta qalabyada kala duwan ee qalabka.
Fahamka Shuruudaha Xusuusta
Xusuusta loo baahan yahay in lagu shubo LLM waxaa ugu horayn lagu go'aamiyaa tirada halbeegyada iyo saxnaanta tirada ee loo isticmaalo in lagu kaydiyo cabbirrada. Xeerka fudud ee suulka waa:
- Soo dejinta moodeel leh cabbirro X bilyan waxay u baahan tahay qiyaas ahaan 4X GB VRAM ee gudaha 32-bit sabayn sax ah
- Soo dejinta moodeel leh cabbirro X bilyan waxay u baahan tahay qiyaas ahaan 2X GB VRAM ee gudaha 16-bit bfloat16/float16 sax ah
Tusaale ahaan, soo dejinta 175B cabbirka GPT-3 wuxuu u baahan doonaa ku dhawaad 350GB ee VRAM ee saxda ah bfloat16. Laga bilaabo maanta, GPU-yada ganacsi ee ugu weyn sida NVIDIA A100 iyo H100 waxay bixiyaan 80GB ee VRAM oo keliya, taas oo lama huraan u ah isbarbardhigga tensor-ka iyo farsamooyinka isbarbardhigga moodeelka.
Inta lagu jiro falanqaynta, raadadka xusuusta waxaa u badan cabbirada moodeelka iyo kiraystayaasha firfircoonida ku meel gaarka ah ee la soo saaray. Qiyaasta heerka sare ah ee isticmaalka xusuusta ugu sarreysa inta lagu jiro qaddarinta waa wadarta xusuusta loo baahan yahay in lagu shubo cabbirrada moodeelka iyo xusuusta hawlgelinta.
Xisaabinta Xusuusta Dhexdhexaadinta
Aynu qiyaasno shuruudaha xusuusta ee tixraaca annagoo adeegsanayna qaabka OctoCode, kaasoo leh ku dhawaad 15 bilyan oo cabbir oo qaab bfloat16 ah (~ 31GB). Waxaan isticmaali doonaa Maktabada Transformers si loo raro qaabka oo loo abuuro qoraal:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import torch model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=0) tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder") pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) prompt = "Question: Please write a Python function to convert bytes to gigabytes.\n\nAnswer:" result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] def bytes_to_gigabytes(bytes): return bytes / 1024 / 1024 / 1024 bytes_to_gigabytes(torch.cuda.max_memory_allocated())
Output:
29.0260648727417
Isticmaalka xusuusta GPU ee ugu sarreysa waxay ku dhowdahay 29GB, taas oo ku habboon qiyaastayada 31GB ee ku dhejinta cabbirrada moodalka qaabka bfloat16.
Hagaajinta xusuusta Inference with Quantization
Iyadoo bfloat16 ay tahay saxnaanta caadiga ah ee loo isticmaalo tababbarka LLMs, cilmi-baarayaashu waxay ogaadeen in qiyaasidda miisaanka moodeelka si loo yareeyo noocyada xogta saxda ah sida 8-bit integers (int8) ama 4-bit integers ay si weyn u yareyn karto isticmaalka xusuusta iyada oo ugu yar luminta saxda ah ee hawlaha soo-jeedinta sida jiilka qoraalka.
Aynu aragno kaydinta xusuusta ee 8-bit iyo 4-bit quantization ee moodeelka OctoCode:
</div> # 8-bit quantization model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_8bit=True, pad_token_id=0) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] bytes_to_gigabytes(torch.cuda.max_memory_allocated())</pre>
Output:
15.219234466552734
# 4-bit quantization model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, low_cpu_mem_usage=True, pad_token_id=0) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] bytes_to_gigabytes(torch.cuda.max_memory_allocated())
Output:
9.543574333190918
Qiyaasta 8-bit, shuruudaha xusuusta ayaa hoos uga dhacaya 31GB ilaa 15GB, halka tirada 4-bit ay sii yareyneyso kaliya 9.5GB! Tani waxay u oggolaanaysaa in lagu socodsiiyo nooca cabbirka 15B ee OctoCode ee GPU-yada macaamiisha sida RTX 3090 (24GB VRAM).
Si kastaba ha ahaatee, ogow in qiyaasid badan oo gardarro ah sida 4-bit ay mararka qaarkood u horseedi karto hoos u dhac sax ah marka loo eego saxnaanta 8-bit ama bfloat16. Waxaa jira is dhaafsi u dhexeeya kaydinta xusuusta iyo saxnaanta ay tahay in isticmaalayaashu ay ku qiimeeyaan kiiskooda isticmaal.
Quantization waa farsamo awood leh oo awood u siin karta in LLM la geeyo meelaha kheyraadka xaddidan sida tusaaleyaasha daruuraha, aaladaha geesaha, ama xitaa taleefannada gacanta iyadoo si weyn u yareyneysa raadadka xusuusta.
Qiyaasta xusuusta ee hagaajinta Fine
Iyadoo qiyaasid inta ugu horraysa loo isticmaalo tilmaamid hufan, farsamooyinka sida isbarbardhigga tensor-ka iyo isbarbardhigga moodeelka ayaa muhiim u ah maaraynta shuruudaha xusuusta inta lagu jiro tababarka ama hagaajinta moodooyinka luqadaha waaweyn.
Isticmaalka xusuusta ugu sarreysa inta lagu jiro hagaajinta hagaajinta ayaa caadi ahaan 3-4 jeer ka sarreeya fikradda sababtoo ah shuruudaha xusuusta dheeraadka ah ee:
- Qalinjabiyeyaasha
- Optimizer states
- Firfircoonida ka gudubka hore ee loo kaydiyay faafinta
Qiyaasta konserfatifka ayaa ah in hagaajinta LLM oo leh X bilyan oo cabbir ay u baahan tahay hareeraha 4 * (2X) = 8X GB ee VRAM ee bfloat16 sax ah.
Tusaale ahaan, hagaajinta 7B cabbirka LLAMA waxay u baahan doontaa qiyaastii 7 * 8 = 56GB ee VRAM halkii GPU ee saxda ah bfloat16. Tani waxay ka badan tahay awoodda xusuusta ee GPU-yada hadda jira, taasoo qasab ka dhigaysa farsamooyinka hagaajinta wanaagsan ee la qaybiyey.
Farsamooyinka Hagaajinta Wanaagsan ee la qaybiyay
Dhowr habab oo hagaajinta wanaagsan ee la qaybiyay ayaa la soo jeediyay si looga gudbo caqabadaha xusuusta GPU ee moodooyinka waaweyn:
- Isbarbar-dhigga Xogta: Habka isbarbardhigga xogta caadiga ah wuxuu ku celceliyaa qaabka oo dhan GPU-yo badan iyadoo la qaybinayo lana qaybinayo dufcadaha xogta tababarka. Tani waxay yaraynaysaa wakhtiga tababarka si toos ah tirada GPU-yada laakiin ma dhimayso shuruudaha xusuusta ugu sarreeya ee GPU kasta.
- ZeRO Marxaladda 3: Qaab horumarsan oo isbarbar socda xogta kaas oo qaybiya cabbirada moodeelka, gradients, iyo hagaajinta gobolada GPU-yada oo dhan. Waxay yaraynaysaa xusuusta marka la barbar dhigo isbarbardhigga xogta caadiga ah iyadoo lagu hayo kaliya xogta qaybsan ee loo baahan yahay GPU kasta inta lagu jiro marxaladaha kala duwan ee tababarka.
- Isbarbar yaaca Tensor: Halkii laga beddeli lahaa moodalka, isbarbardhigga tensor-ku wuxuu u qaybiyaa cabbirada tusaalaha safaf ama tiirar wuxuuna u qaybiyaa GPU-yada oo dhan. GPU kastaa wuxuu ku shaqeeyaa qaybo kala qaybsan oo cabirro ah, gradients, iyo dawlado wanaajiya, taasoo horseedaysa kaydin la taaban karo oo xusuusta ah.
- Isbarbar-dhigga DhuumahaFarsamadani waxay u qaybisaa lakabyada moodalka ee GPU-yada/shaqaalaha kala duwan, iyadoo qalab kastaa fulinayo qayb-hoosaadyada lakabyada. Dhaqdhaqaaqa ayaa u dhexeeya shaqaalaha, hoos u dhigaya xusuusta ugu sarreysa laakiin kordhinta isgaadhsiinta.
Qiyaasta isticmaalka xusuusta ee hababkan la qaybiyay waa wax aan sahlanayn maadaama qaybinta cabbirada, gradients, firfircoonida, iyo hagaajinta dawladaha ay ku kala duwan yihiin farsamooyinka. Waxaa intaa dheer, qaybo kala duwan sida jirka beddelka iyo madaxa qaabaynta luqadda ayaa laga yaabaa inay muujiyaan dabeecadaha qoondaynta xusuusta ee kala duwan.
Xalka LLMem
Cilmi-baarayaasha ayaa dhawaan soo jeediyay LLMem, Xalka si sax ah u qiyaasaya isticmaalka xusuusta GPU marka la dabaqayo habab hagaajin-wanaagsan oo loo qaybiyay LLM-yada GPU-yada badan.
LLMem waxay tixgelisaa arrimo ay ka mid yihiin dib-u-habaynta cabbirrada ka hor xisaabinta (Heerka ZeRO 3), ururinta wax-soo-saarka ee ka-dhaafka dambe (bar-bar-bar-dhigga tensor), iyo xeeladaha kala duwan ee qoondaynta xusuusta ee beddelka jirka iyo madaxa qaabaynta luqadda.
Natiijooyinka tijaabada ah waxay muujinayaan in LLMem ay qiyaasi karto isticmaalka xusuusta GPU ugu sarreysa ee hagaajinta LLM-yada hal GPU oo leh heerar khalad ah ilaa 1.6%, taasoo ka sarraysa celceliska heerka qaladka casriga ah ee DNNMem 42.6%. Marka la dabaqayo hababka hagaajinta wanaagsan ee loo qaybiyey LLM-yada leh in ka badan hal bilyan oo cabbir oo GPU-yo badan, LLMem waxay ku guulaysataa celceliska qaladka qaladka ah 3.0%.