Low-cost computation for isolated sign language video recognition with multiple reservoir computing

A.R. Syulistyo; Y. Tanaka; D. Pramanta; N. Fuengfusin; H. Tamukoh

doi:10.1371/journal.pone.0322717

Low-cost computation for isolated sign language video recognition with multiple reservoir computing

Syulistyo, A.R.;Tanaka, Y.;Pramanta, D.;Fuengfusin, N.;Tamukoh, H. 2025-07-30 00:00:00 Sign language recognition (SLR) has the potential to bridge communication gaps and empower hearing-impaired communities. To ensure the portability and accessibil- OPEN ACCESS ity of the SLR system, its implementation on a portable, server-independent device Citation:SyulistyoAR,TanakaY,PramantaD, becomes imperative. This approach facilitates usage in areas without internet connectiv- FuengfusinN,TamukohH(2025)Low-cost ity, addressing the need for data privacy protection. Although deep neural network mod- computationforisolatedsignlanguage video recognitionwithmultiplereservoircomputing. els are potent, their efficacy is hindered by computational constraints on edge devices. PLoSOne20(7):e0322717. This study delves into reservoir computing (RC), which is renowned for its edge-friendly https://doi.org/10.1371/journal.pone.0322717 characteristics. Through leveraging RC, our objective is to craft a cost-effective SLR Editor:FahdSaeedAlakbari,Universiti system optimized for operation on edge devices with limited resources. To enhance the TeknologiPetronas:UniversitiTeknologi, recognition capabilities of RC, we introduce multiple reservoirs with distinct leak rates, MALAYSIA extracting diverse features from input videos. Prior to feeding sign language videos Received:November8,2024 into the RC, we employ preprocessing via MediaPipe. This step involves extracting the Accepted:March26,2025 coordinates of the signer’s body and hand locations, referred to as keypoints, and nor- Published:July 30,2025 malizing their spatial positions. This combined approach, which incorporates keypoint extraction via MediaPipe and normalization during preprocessing, enhances the SLR Copyright:©2025 Syulistyo etal.Thisisan openaccessarticledistributedundertheterms system’s robustness against complex background effects and varying signer positions. oftheCreativeCommonsAttributionLicense, Experimental results demonstrate that the integration of MediaPipe and multiple reser- whichpermitsunrestricted use,distribution, voirs yields competitive outcomes compared with deep recurrent neural and echo state andreproductioninanymedium,providedthe networks and promises significantly lower training times. Our proposed MRC achieved originalauthorandsourcearecredited. accuracies of 60.35%, 84.65%, and 91.51% for the top-1, top-5, and top-10, respectively, Dataavailabilitystatement: Thedata on the WLASL100 dataset, outperforming the deep learning-based approaches Pose- underlyingtheresultspresentedinthestudyare availablefromhttps://dxli94.github.io/WLASL/ TGCN and Pose-GRU. Furthermore, because of the RC characteristics, the training [email protected] time was shortened to 52.7 s, compared with 20 h for I3D and the competitive inference furtherassistance. time. Funding:JSTALCA-Next (https://www.jst.go.jp/alca/en/index.html):(a) JPMJAN23F3=Prof.HakaruTamukoh PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 1/ 30 ID: pone.0322717 — 2025/7/27 — page 2 — #2 PLOS ONE Low-cost computation for isolated sign language video recognition (https://researchmap.jp/read0109207?lang=en). Introduction JSPSKAKENHI(https://www.jsps.go.jp/english/ Languageservesasavitalmeansofcommunication,eachwithsyntaxandgrammar[1].Sign e-grants/):(a) 23K28158,23K18495 =Prof. language,whichisutilizedbyindividualswithhearingimpairments,presentsauniquelin- HakaruTamukoh(https://researchmap.jp/ read0109207?lang=en)(b)23K28158, guisticform.eTh WorldHealthOrganization(WHO)estimatesthat,asof2021,430million 22K17968=Assoc.Prof.YuichiroTanaka peoplegrapplewithdeafness[2].Deafnessextendsitsimpactacrossvariousfacets,including (https://researchmap.jp/tanaka-yuichiro)(c) education,employment,socialdynamics,loneliness,andstigma.Despitetheuniversalrightto 23K28158=DindaPramanta equalopportunities,globaldisparitiespersist,notablyineducation.Communicationbarriers, (https://researchmap.jp/read030909?lang=en). especiallyforthosereliantonsignlanguage,contributetothisinequality. Allfunderdidnotparticipate intheresearch. Thispaperissupported bythe NEDOproject Challengesarisewhenindividualsusingsignlanguageattempttocommunicatewiththose andtheprincipalinvestigator (Prof.Takashi unfamiliarwithit,hinderingthesmoothexchangeofinformation[3].Advancedtechnologies Morie(https://hyokadb02.jimu.kyutech.ac.jp/ oerff apotentialsolution,bridgingthecommunicationgapbetweenhearing-impairedindi- html/339_en.html))isnotdirectlyrelatedtothis vidualsandothers.Apivotaltoolinthisregardisasignlanguagerecognition(SLR)system, paper.However,thecoinvestigators(Prof. whichprocessesinputstorecognizespecificlabels[ 4–6].Thisstudyaimstodevelopamodel HakaruTamukohandAssoc.Prof.Yuichiro requiringmodestcomputationalresourcesforintegrationintoedgedevices.eTh implemen- Tanaka)contributetothispaper.TheNew EnergyandIndustrialTechnologyDevelopment tationofSLRinedgecomputingoerff sadvantagessuchasportability,enhanceddataprivacy, Organization(https://www.nedo.go.jp/english/): reducedtransmissioncosts,andusabilityinareaslackinginternetconnectivity[7]. GrantnumberJPNP16007. Thefundershadno SLRresearchfallsintotwoprimarycategories[6]:continuousSLR,whichrecognizesone roleinstudydesign,datacollectionand ormorelabelsincontinuousstreaminput,andisolatedSLR,whichidentifiesonesignata analysis,decisionto publish,orpreparationof time.ThisstudyspecificallytargetsisolatedSLRwithlowcomputationalresourcerequire- themanuscript. ments.SLRcategorizationisbasedoninputtypes,distinguishingbetweenvision-based, Competinginterests:Theauthorshave sensor-based,andhybridapproaches[3,5,8].Vision-basedinputinvolvesimageorvideo declaredthatnocompetinginterests exist. acquisitionforprocessingthesigner’sposeinformation.Sensor-basedmethodsutilizewear- ablesensorstocapturehandgesturesandtheirpositions.Hybridapproachesintegratevision- basedcamerasandvarioussensors,suchasdepthcamerasensors.Giventheuser-friendly natureofvision-basedapproaches,particularlytheminimalrestraintimposedonuserscom- paredwithsensor-basedmethods,SLRresearcherspredominantlyemphasizevision-based systems.Calibrationchallengesbetweenvision-basedmodalitiesandwearablesensors,as encounteredinhybridsystems,canbeparticularlyintricate.Consideringtheadvantages ofthevision-basedapproachandpreviousstudies,thisstudyconcentratesonvision-based methodology,employingvideosasinput.Employinganempiricalmethod,theSLRfunction usesacameratocapturesignermovements,subsequentlyprocessingthemfurtherthrougha classificationalgorithm. eTh domainofSLRpresentsamultitudeofchallenges,encompassingdisparatevideo lengths,analogousgesturesaffiliatedwithdistinctlabels,variationsingestureswithinthe samelabel[9],andtheimperativeaspectofreal-timeSLR[8].Noteworthyendeavorshave beenundertakenbyscholars,includingLietal.[9],whoproposedasizableAmericanSign Languagevideodataset,therebycontributingtoapubliclyaccessiblerepository.Foraparal- leltrajectory,Subramanianetal.[10]devisedastreamlinedapproachbydevelopingamin- imizedgatedrecurrentunit(GRU)model.Thisinnovativemodelnotonlyexpeditescon- vergencebutalsomitigatesthecomputationaloverheadassociatedwiththeconventional GRU.Extendingtheircontributions,Subramanianetal.[11]suggestedthefusionofMedi- aPipe[12]withanoptimizedGRUarchitecture,ensuringefficientinformationprocessing. MediaPipe,aninstrumentcreatedbyGoogle,servesthepurposeofconstructingefficient on-devicemachinelearningpipelinestailoredfortheprocessingofvideo,image,text,and audio. eTh applicationofdeeplearninginSLRhasbeenfrequentowingtoitsinherentability toclassifybothspatialandtemporalfeaturesaccurately.eTh deeplearningsystemsapplied includepose-basedtemporalgraphconvolutionnetwork(Pose-TGCN)[9],pose-gated PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 2/ 30 ID: pone.0322717 — 2025/7/27 — page 3 — #3 PLOS ONE Low-cost computation for isolated sign language video recognition recurrentunit(Pose-GRU)[9],inflated3DConvNet(I3D)[ 9],andMediaPipeOptimized GRU(MOPGRU)[10].Recentstudieshaveproposedutilizingdeepneuralnetworks(DNNs) withSLRsystems.However,DNNspossessintricatearchitecturesthatheavilydependon GPUs,posingchallengesintheirimplementationonedgedevices[7]thatrequireasignif- icantamountofcomputation[13],whichcanleadtoincreasedpowerconsumptionand latency.Additionally,DNNstypicallyrequirelongtrainingtimes,whichcandelaymodel updates[14].Toovercomethesechallenges,analternativeapproachinvolvingRChasbeen suggested[12,15–17].RC,knownforitssuitabilityforlow-costreal-timecomputation,holds promiseforthedevelopmentofmachinelearninghardwaredevices[18–21].Itisessential tounderscoreRC’sproficiencyinclassifyingtemporalfeaturesrelevanttothisareaandits abilitytohandlemultivariatefeatures[22].Furthermore,thehypothesispositedbyLiand Tanakasuggeststhattheenrichmentoffeaturerepresentationsextractedfromtheinputcan leadtoimprovedaccuracy[23].Inthecontextofthisstudy,weproposetheintegrationof multiplereservoir-basedRCs(MRCs)withMediaPipeforSLR.Comparedwithconventional RC,MRCattainsamorecomprehensivefeaturerepresentation,employingdistinctleakrates withineachreservoirtoenhancelearningfromvideoinput.eTh proposedmethodprocesses temporalinputdata,specificallyhandandbodykeypointsextractedbyMediaPipefrominput videos.AdistinctivecontributionofthisstudyliesintheintegrationofMediaPipewithMRC, anaspectthathasnotbeenexploredinpreviousstudiesonSLRemployingechostatenetwork (ESN)-basedmethods. eTh primarycontributionsofthisstudyareasfollows: • Tothebestofourknowledge,thisstudyisthefirsttoemployRCforthetaskofSLR,oerff - inganovelapproachtothisdomain. • WeintroduceanRC-basedframeworkthatdemonstratesperformancecomparabletothat ofexistingdeeplearningmethodswhilesubstantiallyreducingthecomputationaltraining time. • eTh implementationismadepubliclyavailableasopen-sourcecodeat https://github.com/ tamukohlaboratory/MultipleReservoirComputing-MRC,promotingtransparencyand facilitatingfurtherresearchinthefield. eTh remainderofthispaperisstructuredasfollows:Section2providesanoverviewof relatedworkinSLR.Section3elucidatestheconceptofRC.InSection4,acomprehensive accountoftheresearchmethodologyunfolds,encompassingtheutilizeddataandanin-depth expositionoftheproposedmethod.Sections5,6,and7presenttheexperimentalresults, discusstheresults,anddrawconclusions,respectively. Related work eTh advancementofmachinelearninganddeeplearningalgorithmshasyieldedpromising resultsinSLR.SeveralstudieshavebeenconductedtosolvetheproblemofisolatedSLR.eTh inputtotheSLRcanbeclassifiedintostaticimagesandvideos.Throughanextensivereview oftheliterature,weidentifiedfourstudiesemployingstaticimagesasinputs:Shahetal.[ 1], YasumuroandJin’no[24],Bajajetal.[25],andAttiaetal.[26].esTh estudiesaresummarized inTable1. Shahetal.[1]pioneeredthedevelopmentofanSLRsystemtailoredfor36labelswithinthe contextofPakistanSignLanguage,predominantlyrelyingonvisionmodalities.eir Th method encompassesfourdistinctfeatureextractions,namely,speeded-uprobustfeatures(SURFs), localbinarypatterns(LBPs),edge-orientedhistograms(EOHs),andhistogramsoforiented PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 3/ 30 ID: pone.0322717 — 2025/7/27 — page 4 — #4 PLOS ONE Low-cost computation for isolated sign language video recognition Table1.Summaryofsignlanguagerecognitionresearch. RelatedWork Features Classifier UsedLabels Shahetal. [1] SURF,LBP,EOH,HOG SVMwithmultiplekernels 36Pakistan YasumuroandJin’no [24] Keypoints SVM 41Hiragana, 24Alphabetic Bajajetal. [25] Keypoint KNN,randomforestand 24American neuralnetwork Attiaetal. [26] CNN YOLOv5x+attention 36American, methods 66Bangla Lietal. [9] Keypoints TGCN 2000American Bilgeetal. [6] Spatial,temporal,textand GZSSLR 200American attribute Takayamaetal. [27] Keypoint SLGCN-Transformer 2000American, 275Japanese Subramanianetal. [11] Keypoints MOPGRU 12Indian, 100American, 64Argentinian Luqmanetal. [28] Spatialfeatures StackCNNandLSTM 502arabic, 64Argentinian Samaanetal. [29] Keypoints RNN 10American https://doi.org/10.1371/journal.pone.0322717.t001 gradients(HOGs).Eachfeaturespacesubsequentlyundergoesprocessingviatenfoldcross- validationtoascertaintheoptimalkernelamonglinear,Gaussian,andpolynomialsupport vectormachines(SVMs)intermsofachievingthehighestaverageaccuracy.Followingthis, thefeaturespaceassociatedwithaspecifickernel,demonstratingthehighestaverageaccu- racy,isselectedastheSVMkerneltoclassifytheoutputpertainingtothatparticularfeature space. YasumuroandJin’no[24]focusedontherecognitionofJapanesefingerspelling,employ- ingMediaPipe.eir Th approachinvolvestheutilizationofanSVMfortheclassificationtask asanalternativetodeeplearningmethods[25],aimingtoincreasecomputationalefficiency. eir Th studyemployedavideo,processingeachframeasinputtorecognizefingerspelling, encompassing24labelsforthealphabetand41labelsforthehiraganadatasets.Notably,the SVM-basedmethodologydemonstratedareductionincomputationtimecomparedwith deeplearningwhilesimultaneouslyachievingahigherrecognitionrate. Bajajetal.[25]undertookacomprehensiveinvestigationcomparingthreeclassifica- tionalgorithmsinthecontextofSLRsystems:K-nearestneighbor(KNN),randomforest, andneuralnetworks.eir Th researchexplored28distinctpreprocessingcombinationswith thegoalofenhancingtheclassificationalgorithm.eTh experimentalresultsrevealedthat theapplicationofpreprocessingtechniquessignificantlyimprovesaccuracy,withthemost effectivecombinationinvolvingrounding,shifting,andscaling.Moreover,theoptimalclas- sificationalgorithmidentifiedintheirstudywasaneuralnetworkcoupledwiththeaforemen- tionedpreprocessingtechnique. Attiaetal.[26]innovativelydevelopedthreedeeplearningmodelsbasedonYOLOv5x, incorporatingtwoattentionmethods:squeeze-and-excitationandaconvolutionalblock attentionmodulefortheSLRsystem.eTh datasetemployedforthestudycomprised36Amer- icanlabelsand66Banglalabels.eTh rationalebehindselectingYOLOv5x,anextensionof YOLOv5,asthefoundationalmodelliesinitslightweightandrapiddeploymentcapabilities ondiverseedgedevices.Itiscrucialtonote,however,thatthismodelnecessitatesbound- ingboxlabeling,renderingittrainablebutrequiringaconsiderabletimeinvestmentfor annotation. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 4/ 30 ID: pone.0322717 — 2025/7/27 — page 5 — #5 PLOS ONE Low-cost computation for isolated sign language video recognition AsshowninTable1,threeofthefourstudiesthatutilizedstaticimagesemployed classicalmachinelearning,whereasonestudyuseddeeplearning.Notably,considerable emphasishasbeenplacedbyresearchersonoptimizingthecomputationtimeofSLRsystems. Importantly,thepracticalapplicationofSLRinvolvestheanalysisofvideostoidentifylabels onthebasisofmotionsequences.Consequently,thisstudyintentionallyabstainedfromthe useofstaticimages,aligningwiththedynamicnatureinherentinSLRapplications.eTh chal- lengeencounteredintheisolatedSLRofvideoinputsrevolvesaroundthescarcityofpublicly availabledatasets.ThispredicamentwaseffectivelyaddressedbyLi[ 8]throughtheintroduc- tionoftheWord-LevelAmericanSignLanguage(WLASL)videodataset.eTh notablefea- turesofthisdatasetincludeaframerateof25framespersecond(fps)andavideoresolution of256×256.AmbiguityemergesasanotablechallengewithinWLASL.Thisambiguityman- ifestsininstanceswhereidenticalsignlanguagelabelsexhibitdifferentsigns.Furthermore, diversesignlanguagesmaypossessdistinctlabels,suchas“wish”and“hungry”,whilefeatur- ingsimilarsignsormovements[8].Liproposedamethoddesignedforrecognizingisolated signlanguage,denotedaspose-basedtemporalgraphconvolutionnetworks(Pose-TGCNs). ThismethodreliesonOpenPose[ 21]forextractingkeypoints,encompassing13upperbodies and21jointpointsforboththeleftandrighthands.Remarkably,thePose-TGCNdemon- stratescommendableperformance,particularlywhenconfrontedwithalimitedvocabulary sizeof100labels. Bilgeetal.[6]presentedanSLRsystemdesignedtoidentifynovelclassesthrough knowledgetransferfromthetrainingdataset,specificallyaddressingzero-shotlearningsign languagerecognition(ZSSLR)andgeneralizedZSSLR(GZSSLR).eTh authorsemployeda zero-shotlearning(ZSL)frameworktoextendtherecognitionmodel’sapplicabilitytoboth seenandunseenclasses,incorporatingvisualandauxiliaryclassrepresentations.ZSSLRand GZSSLRsharesimilarities,differingonlyinthetestdatautilized:ZSSLRfornovel,unseen testdataandGZSSLRforbothnovel,seen,andunseentestdata.Visualrepresentationswere extractedfromthespatiotemporaldeepmodelencompassingbodyandhandregions.An auxiliaryclassrepresentationwasderivedfromtextualdictionarydefinitionsandattribute combinations.eTh authorsintroducedthreebenchmarkdatasetsinthisstudy:ASL-Text, comprising250labels;andMS-ZSSLR-WandMS-ZSSLR-W,eachcontaining200labels. Despitepromisingresults,theaccuracy,althoughrelativelylowcomparedwiththatofother ZSLmethods,remainedbelow40%. Takayamaetal.[27]extendedbatchnormalizationindeeplearningtoinsertmaskedbatch normalization(MBN)inanexistingSLRsystem.eTh MBNnormalizedtheinputfeaturesin theGCNmodelwhilemaskingthedummysignals.eTh experimentaloutcomesrevealeda noteworthyenhancementintheaccuracyoftheGCN,establishingMBNasaneffectiveclassi- ficationalgorithm.Inthecontextofthisstudy,themostproficientalgorithmidentifiedwasa SignLanguageGraphConvolutionNetworkwithaTransformer(SLGCN-Transformer).This algorithmexhibitedsuperiorperformancewithintheexperimentalframework. Subramanianetal.[11]directedtheirresearchtowardIndianSLRinvolving12distinct classes.eTh authorsintroducedanoptimizedfusionofMediaPipeandaGRU,denotedasthe MOPGRU(MediaPipeOptimizedGatedRecurrentUnit),designedtoprocessvideodatasets effectively.WithintheMOPGRU,modificationswereappliedtotheupdatedgatesofthestan- dardGRU,ensuringthattheoutputsoftheresetgatesre-evaluatedtheinformation,eliminat- ingunwanteddataandprioritizingmeaningfulinformation.Furthermore,themethodpro- posedbytheresearchersunderwentacomparativeanalysiswithastate-of-the-artalgorithm employingWLASL100(WordLargeAmericanSignLanguagewith100labels). Luqmanetal.[28]devisedanSLRmodelthatsynergisticallyemploysaconvolutional neuralnetwork(CNN)andlongshort-termmemory(LSTM).Thisintegrationwasevaluated PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 5/ 30 ID: pone.0322717 — 2025/7/27 — page 6 — #6 PLOS ONE Low-cost computation for isolated sign language video recognition viadatasetscomprising502Arabicand64Argentiniansamples.eTh optimalconfiguration wasidentifiedthroughtheutilizationofstackedMobileNetforfeatureextraction,followedby subsequentprocessingwithstackedLSTM.Thiscombinationemergedasthemosteffectivein achievingthedesiredoutcomesintheirexperimentalframework. Samaanetal.[29]introducedthedynamicsignlanguage(DSL)10dataset,adataset comprising10labelsofASL.eir Th approachinvolvestheapplicationofRNN-basedmodels, suchasGRU,LSTM,andBiLSTM. Allsixstudiesfocusedonvideoinputs,asoutlinedinTable1,andemployeddeeplearning methodologies.AccordingtotheexperimentationconductedbySamaanetal.[29],theuse offacialkeypointsisnotadvisedbecauseofthesixfoldincreaseinprocessedfeatures,leading toheightenedcomputationaldemands.Thisresultsinextendedprocessingtimescompared withscenarioswherefacialkeypointsarenotemployed,whiletheachievedaccuracyremains comparable.Similarly,otherresearchers[11,24,26],and[29]alsoconsiderthecomputational efficiencyofSLR,acknowledgingitssignificanceinensuringstreamlinedprocessing.eTh collectivefindingsfromSLRresearchunderscorereal-timeimplementationonedgedevices asanongoingchallengewithinSLRsystems.Thisexplorationdrivesourresearchefforts, withafocusondevelopingacost-effectiveSLRsolutionapplicabletoedgedevicesadeptat classifyingdynamicinputs.Furthermore,ourproposedmethodcombinescomputationaleffi- ciencyandcompetitiveperformance,unlikedeeplearningmethods,whichoenft demand computationalpowerandtrainingtime. Reservoir computing ESN RCisinspiredbyanaturalphenomenon:whenadropletofwaterfallsontoastillwater surface,itgeneratesripplesthatspreadoutward.eTh patternandintensityoftheseripplesare determinedbythesizeandforceofthedroplet,asillustratedinFig1.erTh efore,observingthe watersurfacecananalyzewhatorhowdropletshavefallen. RCconsistsofinput,reservoir,andoutput,asshowninFig2.eTh watersurfacecanbe regardedasananalogyforthereservoir,withthedropletrepresentingtheinputsignal.Asthe dropletinteractswiththewater,itdisturbsthesurfaceandgeneratesacomplexripplepattern, analogoustohowinputtimeseriesdataaretransformedbythedynamicreservoirinRC.eTh reservoircapturestemporaldependenciesandmapstheinputintoahigh-dimensionalspace Fig1.Reservoirconceptdepictedwithripples. https://doi.org/10.1371/journal.pone.0322717.g001 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 6/ 30 ID: pone.0322717 — 2025/7/27 — page 7 — #7 PLOS ONE Low-cost computation for isolated sign language video recognition Fig2.BasicarchitectureofESN. https://doi.org/10.1371/journal.pone.0322717.g002 calledareservoirstate.Inthefinalstageofthemodel’sdevelopment,thereadoutemploysthe transformedstates,orripplepatterns,toconstructthemodelandperformclassification. RCpresentsarecurrentmodelcapableoftrainingwithoutrelyingonagradientdescent- basedapproach.ThisdesignseekstoovercomethechallengesassociatedwithRNNs,which areknownforbeingchallengingtotrainviagradientdescentmethodsandcomputation- allyintensive[30].IntheRCarchitecture,inputdataundergoprocessingwithinafixed randominternallayerknownasthereservoir,andtheoutputisgeneratedthroughalinear combination,oenft implementedaslinearregression[ 12].Comparedwiththedeeplearn- ingapproach,thismethodologyenablesRCstoachievefastercomputationtimeswithfewer parameters[31]. RCencompassestwoprimarytypes:ESNs[17]andliquidstatemachines(LSMs)[32].eTh primarydistinctionliesintheimplementationoftheneurons.ESNutilizesdiscretedynam- icsandrate-codedneuronsthatintegrateinputsandrecurrentconnections,whereasLSM employscontinuousdynamicsandspikingneurons.Thisstudyfocusespredominantlyonthe ESNapproachbecauseofitssimplicityandrobusttheoreticalfoundation[33].eTh funda- mentalarchitectureofESNisdepictedinFig2andcomprisesfoursteps: 1. GenerateaninputweightW viaEq(1),reservoirweightW viaEq(4),andleakrate in 𝛼 ,scalingintherange [0,1],whichcontrolstheeffectofreservoirstatesattheprevious timesteptothenextreservoirstate.LetN andN denotethedimensionsoftheinput u r N ×N r u andreservoirvectors,respectively.W ∈R representsweightmatricesoftheinput in N ×N r r data,scalingintherange [–𝜎 ,𝜎 ].W∈R denotesweightmatricesoftheinternal neurons,whicharegeneratedviaEqs(2),(3)and(4). W = (2randomBinomial(N ,N )–1)𝜎 , (1) in r u W =random(N ,N ,𝜃 ), (2) 0 r r 𝜌 =max(|eigen(W )|), (3) 0 0 W =W (𝜌 /𝜌 ) (4) 0 0 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 7/ 30 ID: pone.0322717 — 2025/7/27 — page 8 — #8 PLOS ONE Low-cost computation for isolated sign language video recognition Here,randomBinomial(N ,N )representsarandomfunction,whichextractsa r u N ×N r u samplefromthebinomialdistributiontogenerateamatrix∈R . 𝜎 represents theinputscalinghyperparameter,whichcontrolstheinfluenceoftheinputinthe dynamicreservoir.random(N ,N ,𝜃 )representsasparserandomfunctionthatgen- r r eratesamatrixinacertaindimensiononthebasisofthereservoirdimensionN ×N r r andtheparameter𝜃 asaconnectivityvalue,whichrepresentsthepercentageofnonzero valuesinthereservoirthathasavalueintherangeof[0,1].𝜌 representsthespectral radiushyperparameter,whichdefinesthemaximumabsoluteeigenvalueofthereser- voirweightmatrix,andeigen(W )isafunctionforcalculatingeigenvaluesonthebasis ofarandommatrixthatisgeneratedviaEq(2). 2. ProcesstheinputUandcalculatethecorrespondingreservoiractivationstatesx . (t) Wedefinetheinputandreservoiractivationstatesin Eqs(5)and(6),respectively,as follows: N ×N t u U = [u ,u ,u ,...,u ]∈R , (5) (1) (2) (3) (N ) whereN representsthetimelengthoftheinputdata. x = (1–𝛼 )x +𝛼 func(Wx +W u ), (6) (t+Δt) (t) (t) in (t+Δt) whereu representstheinputdata,x representsthereservoirstate,t represents (t+Δt) (t) thediscretetime(1,2...,T),funcrepresentsanactivationfunction,whichtypicallyusesa hyperbolictangent. 3. ComputethelinearreadoutweightsW fromthereservoirusinglinearregression.In out thisstudy,weusedridgeregression,whichminimizestheerrorbetweenY ,thepre- (t) dictedlabelattimet,andtheactuallabelY ,asdefinedin Eq(7),whilepreventing target overtfi tingvia Eq(8). N ×N t y Y = [y ,y ,...,y ]∈R , (7) target target(1) target(2) target(N ) whereN representsthenumberofdimensionsofatargetvector. ⊤ –1 ⊤ W = (X X +𝛽 I) X Y , (8) out 1 target N ×N r r where𝛽 representstheregularizationcoefficient, I∈R representstheidentity N ×N t r matrix,andX∈R representsthereservoirstatevectorX = x ,x ,...,x . (1) (2) (N ) 4. eTh trainednetworkisusedonnewinputdata Uforcomputingthepredictedlabel N ×N N ×N t y r y Y∈R byutilizingthetrainedoutputweightsW ∈R ,whichcanbeformu- out latedbyusingEq(9). Y =XW (9) out GroupedESN GroupedESN[34],[35],and[36]comprisemorethanoneparallelreservoir,denotedasN , andasinglelinearreadoutservesasthedecoder,asillustratedinFig3.Thisapproachis introducedtoextractdiversefeaturesfromtimeseriesinputs,enhancingpredictionperfor- mancebyexpandingthereservoirstatespacetoaugmentitsrepresentationalcapabilities.eTh correspondingreservoirstatecanbecomputedviaEq(10)[34].InthegroupedESN,a PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 8/ 30 ID: pone.0322717 — 2025/7/27 — page 9 — #9 PLOS ONE Low-cost computation for isolated sign language video recognition Fig3.IllustrationofgroupedESNs. https://doi.org/10.1371/journal.pone.0322717.g003 constantleakrateisemployedtocalculatethereservoirstate,withindependentW andW in valuesforeachreservoir. p p p p p x = (1–𝛼 )x +𝛼 func(W x +W u ), (10) (t+Δt) (t+Δt) (t) (t) in whereprepresentstheindexofaparallelreservoir.W andW havethesamegenerationand in distributionasintheESN,asobtainedviaEqs(1)and(4). Reservoirstaterepresentation Inthisstudy,wedrewinspirationfromtheESNimplementationproposedbyBianchietal.[23]. Intheirimplementation,theyuseddropparameters𝛿 ,whichareusedtosetthelengthofthe timestepthatwillbeprocessedinthetrainingbydroppingacertainreservoirstatetimestep, asformulatedinEq(11).eTh 𝛿 parameterisusefulinomittingtimestepsthatdonotsignif- icantlycontributetotherecognitionprocess.eTh resultofthedroppingtimestepisdenoted PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 9/ 30 ID: pone.0322717 — 2025/7/27 — page 10 — #10 PLOS ONE Low-cost computation for isolated sign language video recognition N ×N asX ∈R ,whereN isthenumberoftimestepsaerft thedropprocessonthedrop drop d value𝛿 . X =X[0 ∶N –𝛿 ,0 ∶N ], (11) drop t r whereintheformula,thenotation [0 ∶N ]isdefinedasasliceofarangestartingfromzero andendingatN –1. WealsoadoptthereservoirstaterepresentationmoduleshowninEq(13),whichisrep- resentedbys.Thismoduleutilizesallreservoirdynamics,incontrasttothestandardESN approach,whichemploysthefinalreservoirstatebecausetheutilizationofthefinalstate mayintroducebiasintheoutputmodelingspace.eTh otherobjectiveofthismoduleisto increasethegeneralizationcapacityofreservoirsthatrelyonheterogeneousdynamicsaris- ingfrominputs.Bianchietal.[23]developedanewmodelspaceinwhicheachmultivariate timeseriesisrepresentedbylinearmodelparameters.eTh linearmodelistrainedtopredict thesubsequentreservoirstatedenotedasx byemployingthemathematicalEq(12).sis (t+1) avectoroflengthN ,whereN isequaltothenumberofrows (N +1)N .eTh notation rep rep r r ⊤ ⊤ ⊤ V =Concat(V ,v ) representsamatrixresultingfromtheconcatenationresultofaweight 1 2 N ×N N r r r matrixV ∈R andvectorv ∈R .V,representedbyEq(17),denotestheoutcomeof 1 2 (N –1)×(N +1) (N –1)×N r r d d theridgeregressionofX ∈R onEq(15),whereX ∈R onEq(16) 2 next servesasthetarget.X isformedbyconcatenatingX inEq(14)withonethatisbiasedfor 2 prev theinput.v servesasabiastoadjusttheregressionlinetotfi thedata. V inEq(17)andW 2 out inEq(8)havedierff entpurposes,despitebothequationsutilizingridgeregressionintheir process.Eq(17)isemployedtouseallofthereservoirsbytrainingalinearmodeltopredict thesubsequentstateofthereservoirineachtimestep.Bycontrast,Eq(8)isusedtotrainthe modeltopredicttheoutputsofgiventasks. x =x V +v , (12) 1 2 (t+1) (t) s =vec(V) =Concat(vec(V ),v ), (13) 1 2 where X =X [0 ∶N –1,0 ∶N ], (14) prev drop d r X =Concat(X ,1), (15) 2 prev X =X [1 ∶N ,0 ∶N ], (16) next drop d r ⊤ –1 ⊤ V = (X X +𝛽 I ) X X , (17) 2 2 2 next 2 2 whereConcat(.)istheconcatenationfunctionusedtojoinasequenceofarrayswiththesame shapealonganexistingaxis.eTh vectorizationfunction,designatedas vec(.),isemployedto transformamatrixintoacolumnvector,wherebythecolumnsofthematrixarestackedin averticalconfiguration. 𝛽 istheregularizationparameterforridgeregression,andI isthe 2 2 identitymatrix. eTh utilizationof sintheplaceofthestandardreservoirstaterequiresthemodificationof N ×N N×N rep y y ̂ ̂ thereadoutdesignatedasW∈R andthepredictedlabeldesignatedasY∈R ,as demonstratedinEqs(18)and(19). ⊤ –1 ⊤ W = (S S +𝛽 I ) S Y , (18) 3 3 rep ̂ ̂ Y =SW, (19) PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 10/ 30 ID: pone.0322717 — 2025/7/27 — page 11 — #11 PLOS ONE Low-cost computation for isolated sign language video recognition N×N rep whereS∈R representsS = s ,s ,...,s ,whereN isthenumberofdata.𝛽 rep- (1) (2) (N) N ×N rep rep resentstheregularizationcoefficient, I ∈R representstheidentitymatrix,and N×N Y ∈R isthetargetmatrix. rep Research method Dataacquisition Thisstudyemployssignlanguagevideosasinputdata.Subsequently,MediaPipeisemployed toextractkeypointsfromthevideodatasetforeachframe.eTh extractedkeypointsencom- passthebody,lefthand,andrighthand,collectivelyamountingto150features.Morepre- cisely,66featurespertaintothebody,and42featureseacharededicatedtotheleftandright hands.eTh datasetutilizedinthisstudyisWLASL100,encompassing100distinctlabels. Processingeachvideoframe eTh processingofeachframeinvolvesatwo-stepprocedure:preprocessingandextracting keypointsthroughtheutilizationofMediaPipe.Datapreprocessingplaysapivotalrolein thisresearch,asvariationsinthevideodatasetconditionscanimpacttheaccuracyofthe classificationalgorithm.Toaddressthis,apreprocessingtechnique,namely,normalization andzeropadding,isemployed.Normalizationplaysacrucialroleinaccommodatingthe diversepositionsofsigners,usingthenosepositionasareferenceforeachsigner.eTh pro- cessinvolvesseveralsteps.Initially,thenoseisdetectedasareferencepointlocatedatindex 0intheposelandmark,asillustratedinFig4.Iftheposeisnotdetectedincertainframes, thoseframesaresubsequentlyremoved.eTh noseischosenasareferencebecauseitspoint Fig4.Illustrationofposelandmarks. https://doi.org/10.1371/journal.pone.0322717.g004 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 11/ 30 ID: pone.0322717 — 2025/7/27 — page 12 — #12 PLOS ONE Low-cost computation for isolated sign language video recognition isrelativelystableandnotaeff ctedbyhandmovement,andthispointisappropriatewhen theheadisstable.eTh nextstepinvolvesmappingthekeypointsintoimagecoordinates, followedbysubtractingallkeypointsbythenosecoordinate,termedthedistancekeypoint dKeypoint ∈ R ,asexpressedinEq(20).eTh meanofthe dKeypoint issubsequentlycom- puted,resultinginmeanKeypoint ∈ R ,asdemonstratedinEq(21).Thisvalueisthensub- tractedfromdKeypoint viaEq(23).Inthefinalstep,asper Eq(22),thenormalizationresult u ∈R isobtainedbydividingmeanKeypoint byitsstandarddeviation,computed normalized throughEq(24). dKeypoint =allKeypoint–nosePosition, (20) meanKeypoint =dKeypoint–dKeypoint, (21) u =meanKeypoint/std(meanKeypoint), (22) normalized where input =∑(input)/N, (23) Á À std(input) = ∑(input –input) , (24) N–1 i=1 N representsanumberofinputs,andu representsonetimestepthatwillbecom- normalized N ×N binedforalltimestepsfromonevideotobecomeU ∈ R . normalized Thisstudyalsoexploredanalternativenormalizationapproachusingtheshoulderposi- tionasthereferencepoint.eTh shoulderischosenasareferencepointbecausesignlan- guageprimarilyinvolvestheupperbodyandhandsothatitcanensurehandposition alignmentforSLR.eTh normalizationprocessisperformedbycomputingthecenterpoint oftheshouldersviaEq(25).eTh lengthoftheshoulderisthencalculatedvia Eq(26).In thefinalstep, allKeypoint,whichcombineshandandposelandmarks,isnormalizedvia Eq(27). leftShoulder +rightShoulder middle = (25) leftShoulderandrightShoulderrepresentthexandycoordinatesoftheleftandright shoulderpositions,respectively. lScale =||leftShoulder–rightShoulder|| (26) ||.||denotesthenormorabsolutefunction. allKeypoint–middle sNormalize = (27) lScale Byintroducinganotherpreprocessingtechnique,zeropadding,denotedasU ∈ padding N +padding×N R ,isperformedsubsequenttonormalization.Thisstepisimplementedtostan- dardizethelengthofthevideotimestepsacrossdatasets,ensuringuniformityintempo- raldimensions.Bothnormalizationandzeropaddingareintegralcomponentsofboth thetrainingandtestingprocesses.Inadditiontothesetechniques,anextrapreprocessing step,exclusivelyemployedduringtraining,isincorporated,termedaugmentation.Aug- mentationiscrucialinaddressingspecificchallengesencounteredinsignlanguagevideos, PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 12/ 30 ID: pone.0322717 — 2025/7/27 — page 13 — #13 PLOS ONE Low-cost computation for isolated sign language video recognition wheresignerspredominantlyemployeithertheleftorrighthand.Tomitigatethisbias,hor- izontalflippingisappliedinthisstudy.Bydoingso,theclassificationalgorithmisadeptat learningandadaptingtoscenarioswherethesignerpredominantlyuseseithertheleftor righthand. Proposedmethods Thisstudyintroducesanovelapproach,termedMRC,thatintegratesMediaPipeintotheSLR pipeline,asillustratedinFig5.PrecedingtheRCprocessingstep,featurenormalizationand zeropaddingareexecuted,involvingthecalculationsoutlinedinEqs(20),(21),and(22). eTh preprocessedfeaturesarethenfedintotheMRC,asdepictedin Fig6(a),employing distinctleakrates𝛼 foreachreservoir.eTh parallelreservoirs,denotedbytheindexrepre- sentationp,calculatethereservoirstateviaEq(28).eTh influenceofthepreviousstateon thecurrentstatevariesonthebasisoftheleakrate;alowerrateimpliesamoresignificant influence,whereasahigherrateresultsinlessimpact.Thisdiversificationinreservoirchar- acteristicswithintheMRCfacilitatestheextractionofdistinctsigningspeeds,contribut- ingtoaricherdatarepresentationthanaconventionalRC.eTh reservoirstatesfromall thereservoirsintheMRCareaggregated,andtheresultingrepresentationisfurtherpro- cessedthroughEq(13).Subsequently,linearregressionisappliedfortrainingorinference viaEq(18). Algorithm1presentsthepseudocodefortrainingtheMRC,whereasAlgorithm2outlines thepseudocodeforinference.Throughoutthetrainingandinferenceprocesses,variousfunc- tionscomeintoplay.Specifically, generateInternalWeight(.)isutilizedtogenerateW,asillus- tratedinEq(4).Additionally,thefunctiongenerateInputWeight(.)isemployedtocreateW in followingEq(1).eTh function reservoirState(.)isinvokedtocalculatethereservoirstate,as indicatedinEq(28).Furthermore,thefunctions(.)isemployedforcomputingthereservoir representation,asdepictedinEq(13).eTh function TrainRegressionisutilizedtotrainthe reservoirweight,followingEq(18). p p p p p p p x = (1–𝛼 )x +𝛼 func(W x +W u ) (28) (t+Δt) in (t+Δt) (t) (t) eTh weightsgeneratedinthetrainingprocessoutlinedin Algorithm1aresubsequently employedtopredictthelabelsY ofthetestdata,asdetailedinAlgorithm2.Thisprocess involvesutilizingtheloadTrainingInternalWeight()functionforW ,loadTrainingIn- in putWeight()forW,andthereadoutweightW. Experiments Experimentalsetting eTh SLRexperimentwasconductedusingPythonversion3.10onapersonalcomputerfeatur- inganIntelCorei7centralprocessingunit(CPU),32GBofrandomaccessmemory(RAM) anda12GBNVIDIAGeForceRTX4070Tigraphicsprocessingunit(GPU).eTh WLASL100 datasetwaspartitionedintothreesegments,training,validation,andtesting,comprising1780 videos,258videos,and258videos,respectively. eTh proposedMRCencompassestwodistinctarchitecturalconfigurations,eachcompris- ing300and510reservoirnodes.eTh aforementionedarchitecturesarecomposedofeither twoorthreeparallelreservoirs.eTh leakageratesappliedineachreservoirvarytoenhance temporalfeatureextraction.eTh valuesaresetat0.9forthefirstreservoir,0.8forthesecond reservoir,and0.6forthethirdreservoirinthethree-reservoirconfiguration.Furthermore, PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 13/ 30 ID: pone.0322717 — 2025/7/27 — page 14 — #14 PLOS ONE Low-cost computation for isolated sign language video recognition Fig5. Signlanguagerecognitionpipeline. https://doi.org/10.1371/journal.pone.0322717.g005 aparameterof0.3isassignedforthespectralradius𝜌 ,whichdeterminesthelargestvalue oftheabsoluteeigenvalueofthereservoir.Otherkeyparametersincludevfi eforthenum- berofreservoirstatestobedropped𝛿 ,0.2fortheconnectivityvalue𝜃 ,and𝛽 (15forV in Eq(17)))andregularizationcoefficientsof 𝛽 (3forWinEq(18)).Bothcoefficients utilizetheridgeregressionalgorithm.esTh evaluesareobtainedfromahyperparameter optimizationframework,Optuna[37].eTh searchspaceforeachhyperparameterisshownin Table2. eTh hyperparameterimportanceanalysisinFig 7showstheaverageresultoftheOptuna hyperparameterimportancevaluesduringfine-tuningfrom10optimizationrunsand30tri- alsforeachrun.eTh optimizationrunsrevealthatw_ridge_embedding( 𝛽 )hasthemostsig- nificantimpactonmodelperformance,indicatingthatcontrolling 𝛽 intrainingiscrucialfor improvinggeneralization.Similarly,thespectralradius𝜌 contributesalmostequally,suggest- ingthatbothparametersplayakeyroleinmodelstabilityandfeaturetransformation.eTh leakparameters𝛼 (leakrates1(𝛼 ),2(𝛼 ),and3(𝛼 )playasignificantyetsecondaryrole, 1 2 3 indicatingthatfine-tuningthemcouldoptimizememoryandstatepropagationinreservoir computing.Moreover,inputscaling(𝜎 )hasanoticeablebutlowerinfluence,meaningthat itaeff ctsmodelsensitivitybutisnotascriticalastheotherparameters.Ontheotherhand, PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 14/ 30 ID: pone.0322717 — 2025/7/27 — page 15 — #15 PLOS ONE Low-cost computation for isolated sign language video recognition Fig6. a)Reservoircomputingbasedonmultiplereservoirs.b)IllustrationofmatrixsizeonMRC510. https://doi.org/10.1371/journal.pone.0322717.g006 w_ridge(𝛽 ),thedropreservoir(𝛿 ),andconnectivity(𝜃 )haveminimalimpacts,suggesting thattheirtuningislesscriticalandthatdefaultvaluesmaybesufficient. InFig6(b),theworst-caseexperimentalscenarioformatrixoperationsinthisresearchis illustrated.eTh MRC170*3(MRC510)configuration,comprisingthreereservoirswith170 nodeseach,resultsinatotalof510nodesinthereservoir.Here,N representsthenumberof matrixsamples,N denotesthenumberoftimesteps,N representsthenumberoffeatures, t f N representsthenumberofparallelreservoirs,andN representsthenumberoflabels.eTh p y matrixsizeontheMRCiscomparabletothatonthestandardRC,involvingthreematrix multiplicationprocessesinthereservoirstatelayer,reservoirstaterepresentation,andread- outlayer,allofwhichemploylinearregression.Followingthereservoirstatelayer,atimestep reductionfrom203–198occursbecausethe𝛿 valueissettovfi e. eTh ESNandgroupedESNdifferfromMRCprimarilyinonehyperparameter.ESNshares thesameleakrateandasinglereservoir,mirroringgroupedESN.Toalignthereservoirnodes withtheMRCandgroupedESN,weestablishreservoirsizesof300and510fortheESN. Conversely,groupedESNmaintainsthesameleakratebutfeaturestwoandthreereservoirs, akintoMRC.WedeterminetheoptimalleakagerateforgroupedESNtobe0.9. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 15/ 30 ID: pone.0322717 — 2025/7/27 — page 16 — #16 PLOS ONE Low-cost computation for isolated sign language video recognition Algorithm1.TrainingprocessofMRC Input Input data matrix U, input data on the t timestep u , target data matrix Y , internal unit number of reservoir (t) rep N , number of parallel reservoirs N , leaking rate for each r p reservoir 𝛼 , spectral radius 𝜌 , connectivity 𝜃 , input scaling 𝜎 , weight matrices of the internal neurons W, weight matrices of input data W , time length of input data N , and the number in t of reservoir states to be dropped 𝛿 Output decoding module W 1: for p = 1 to N do 2: W[p] =generateInternalWeight(N ,𝜌 ,𝜃 ) 3: W [p] =generateInputWeight(U,N ,𝜎 ) in r 4: end for 5: for p = 1 to N do 6: for t = 0 to N –1 do p p 7: x =reservoirState(𝛼 ,W[p],x ,W [p],u ) p in (t+Δt) (t+Δt) (t) 8: end for p p p 9: X =Concat(x ,x ,...,x ) (t) (t+1) (N –1) 10: X [p] =X[∶N –𝛿 , ∶N ] drop t r 11: if p =1 then 12: allX =X [p] drop 13: else 14: allX =ColumnStack(allX,X [p]) drop 15: end if 16: end for 17: S =Concat(s(allX[0]),...,s(allX[N])) 18: W = TrainRegression(S,Y ) rep 19: return W eTh proposedmethodunderwentacomparativeanalysiswithtwodeeplearning approaches:thebidirectionalgatedrecurrentunit(BiGRU)andone-dimensionalconvolu- tion(Conv1D)combinedwiththeBiGRU,denotedasConv1D+BiGRU.eTh selectionofthe BiGRUasabenchmarkalgorithmisgroundedincompellingfindingsfromSubramanian’s research[12].eTh BiGRUarchitectureencompassesninelayers,featuringthreeGRUlayers, onebatchnormalizationlayer,twodropoutswithratiosof0.2and0.3,andthreedenselayers. –4 eTh trainingwasconductedover150epochswithalearningrateof10 ,utilizingAdamopti- mizationwithexponentialdecayratesof0.9and0.999.eTh BiGRUarchitectureisvisually depictedinFig8.Fig9illustratestheConv1D+BiGRUlayer,whichisabsentintheBiGRU architecture.eTh inclusionofConv1Dismotivatedbythetemporalnatureofthedata,which areorganizedastimeserieswitheachrowcorrespondingtoatimestep.eTh outputshapesfor eachlayerinthearchitecturesaredisplayedinbothfigures.eTh dimensions N,N ,andN rep- t f resentthenumberofsamples,timesteps,andfeatures,respectively.Notably,theBiGRU3(64) layeroutputsatwo-dimensionalshapebecausethenetworkreturnsthefinalcellstatewithout theinputsequence.Thisfinalstateiscomprehensiveinfeatures,facilitatinglabelprediction fromtheinputdata. Inaccordancewiththeaforementionedexperimentalsetup,theachievedaccuracyover 150epochsisdepictedinFigs10and11.BoththeBiGRUandConv1D+BiGRUexhibita continualimprovementinaccuracyonbothtrainingandvalidationdatathroughoutthe PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 16/ 30 ID: pone.0322717 — 2025/7/27 — page 17 — #17 PLOS ONE Low-cost computation for isolated sign language video recognition Algorithm2.InferenceprocessofMRC Input input data matrix U, input data on the t timestep u , internal unit number of reservoir N , number of paral- (t) r lel reservoirs N , leaking rate for each reservoir 𝛼 , weight p p matrices of the internal neurons W, weight matrices of input data W , time length of input data N , trained output weights in t W, and the number of reservoirs state to be dropped 𝛿 Output Prediction label Y 1: W =loadTrainingInternalWeight() 2: W =loadTrainingInputWeight() in 3: W =loadTrainingOuputtWeight() 4: for p = 1 to N do 5: for t = 0 to N –1 do p p 6: x =reservoirState(𝛼 ,W[p],x ,W [p],u ) p in (t+Δt) (t+Δt) (t) 7: end for p p p 8: X =Concat(x ,x ,...,x ) (t) (t+1) (N –1) 9: X [p] =X[∶N –𝛿 , ∶N ] drop t r 10: if p =1 then 11: allX =X [p] drop 12: else 13: allX =ColumnStack(allX,X [p]) drop 14: end if 15: end for 16: S =Concat(s(allX[0]),...,s(allX[N])) ̂ ̂ 17: Y = SW 18: return Y Table2.Hyperparametervaluerangesearchspace. Hyperparameter Symbol Value Leakrate 𝛼 0.1to1 Spectralradius 𝜌 0.1to1 Connectivity 𝜃 0.1to1 Reservoirstaterepresentationregularizationcoefficient(w_ridge) 𝛽 1to30 Readoutregularizationcoefficient(w_ridge_embedding) 𝛽 1to30 Dropreservoir 𝛿 1to10 Inputscaling 𝜎 0.1to1 https://doi.org/10.1371/journal.pone.0322717.t002 epochs,indicatingeffectivelearningfromthedataset.Notably,anin-depthanalysisreveals that,evenbeforecompletingthe150epochs,bothalgorithmsdemonstratesuperiorperfor- mance.Inlightofthisobservation,themodel’soptimalaccuracyisselectedasthecriterion forpredictingtestdatainthisstudy.Moreover,thereservoiralgorithm’sprocessingisnotably morestraightforwardthanthatofdeeplearningalgorithms.Inthisalgorithm,onlythefinal layer,referredtoasthereadoutlayer,undergoesweightupdatesviaEq(18).Importantly,the trainingofthereservoiralgorithmisaone-timeprocess. eTh experimentalscenariosaredividedintothreeparts.First,asensitivityanalysisofthe leakrateoptimizedwithOptunawasperformed.AcomparisonoftheSLRperformanceof thedeepRNNandESN-basedalgorithmswasthencarriedoutonthreetypesofextracted features.eTh firsttypeoffeaturewasextractedwithoutnormalization.eTh secondtypeof PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 17/ 30 ID: pone.0322717 — 2025/7/27 — page 18 — #18 PLOS ONE Low-cost computation for isolated sign language video recognition Fig7.Importanceofhyperparameter. https://doi.org/10.1371/journal.pone.0322717.g007 featurewasnormalizedbasedontheshoulderasareferencepoint.eTh thirdtypeoffeature wasnormalizedbasedonthenoseasareferencepoint.Inthethirdscenario,theoptimal resultsfromthesecondscenariowereselectedandthencomparedwiththoseoftheexisting SLRalgorithm. Experimentalresults eTh sensitivityanalysisconductedinthisstudyaimedtovalidatetheleakratevaluessug- gestedbyOptuna.Inthisscenario,thefeatureusedwasanextractedfeaturewithoutnormal- ization.eTh resultsarepresentedin Fig12,wheretheaccuracyvariationacrossdifferentleak ratescanbeobserved.eTh figureclearlyshowsthattheaccuracydifferencesacrossvarious leakrateswerenotsubstantial,indicatingthatthemodelremainsrelativelystablewithinthe testedrange.Optuna-suggestedleakratesof0.9,0.8,and0.6,whichachievedaccuraciesof 42.17%,41.98%,and42.33%,respectively.eTh highestrecordedaccuracywas42.44%ataleak rateof0.5,showinga0.27%differencefromtheOptuna-selected0.9leakrate. esTh eresultssuggestthatOptuna’sselectionisreasonableandfallswithinastableregion. However,thehighestaccuracydidnotoccurattheexactOptuna-suggestedvalues,indicating thatslightadjustmentsintheleakratemayfurtherenhanceperformance.Giventheminor ucfl tuationsinaccuracy(allwithin1.24%ofthepeakvalue),itcanbeconcludedthatthe modelisnothighlysensitivetovariationsintheleakratewithinthisrange. eTh secondexperimentalscenariowasconcernedwithacomparisonoftheaccuracyof SLRfromdeepRNNandESN-basedalgorithms.Asummaryoftheexperimentalresults ispresentedinTable3,whichshowstherecognitionperformancewithoutnormalization. Additionally,Tables4and 5displaytherecognitionperformancevianormalizationwithnose andshoulderasreferencepoints.eTh normalizationiscomputedvia Eqs(22)and(27)for nosenormalizationandshouldernormalization,respectively.Inthesetables,Accrefersto accuracy,andSDindicatesthestandarddeviation.eTh averagetrainingandinferencetimes PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 18/ 30 ID: pone.0322717 — 2025/7/27 — page 19 — #19 PLOS ONE Low-cost computation for isolated sign language video recognition Fig8.ArchitectureofBiGRU. https://doi.org/10.1371/journal.pone.0322717.g008 arerepresentedinmm:ss.ms,whichmeansminutes,seconds,andmicroseconds.eTh impact ofnosenormalizationisvisuallydepictedinFig13.eTh normalizationprocessinvolves shiftingbasedonthenosepositionandscalingoftheoriginalkeypoints,asillustratedin Figs.13(b)and13(e).esTh eimagesrevealdistinctdistributionsofkeypointsduetovariations insignerpositionsandpostures.Followingnormalization,thekeypointdistributionsbecome comparable,asevidentinFigs.13(c)and 13(f). eTh experimentalresultsrevealedthatnormalizingsignificantlyimprovedtherecognition accuracyacrossallthemodels.FromTables4and5,nose-basednormalizationoutperforms shoulder-basednormalization.Forexample,MRC100*3achieved44.81%accuracywithout PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 19/ 30 ID: pone.0322717 — 2025/7/27 — page 20 — #20 PLOS ONE Low-cost computation for isolated sign language video recognition Fig9.ArchitectureofConv1D+BiGRU. https://doi.org/10.1371/journal.pone.0322717.g009 normalization,56.43%accuracywithshouldernormalization,and60.35%accuracywithnose normalization,reflectinganimprovementofapproximately15.54points.Similarly,BiGRU’s accuracyincreasesfrom35.74%withoutnormalizationto46.94%withshouldernormaliza- tionand50.36%withnosenormalization,whereasConv1D+BiGRUimprovesfrom29.65% to40.54%withshouldernormalizationand46.59%withnosenormalization.Thissuggests PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 20/ 30 ID: pone.0322717 — 2025/7/27 — page 21 — #21 PLOS ONE Low-cost computation for isolated sign language video recognition Fig10.Trainingaccuracy. https://doi.org/10.1371/journal.pone.0322717.g010 Fig11.Validationaccuracy. https://doi.org/10.1371/journal.pone.0322717.g011 thatnormalizationenhancesthespatialrepresentation,enablingmodelstobettercapture thedynamicpatternsofsignlanguagegestures.Guidedbythesefindings,normalizationwas employedinsubsequentexperimentstooptimizemodelperformance. Fiveiterationswereusedintheexperiments,withtheaimofscrutinizingthestandard deviation(SD)ofeachalgorithm.eTh SDservesasametrictogaugethevariabilityinaccu- racyvaluesobtainedduringtheexperiments,withlowervaluesbeingpreferable.Forthe deeplearningalgorithm,150epochswereemployed.eTh accuracyineachtabledepictsthe averageaccuracyattainedbythealgorithmacrossvfi etrainingandtestingsessionswiththe PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 21/ 30 ID: pone.0322717 — 2025/7/27 — page 22 — #22 PLOS ONE Low-cost computation for isolated sign language video recognition Fig12.ImpactoftheleakrateontheSLRaccuracy. https://doi.org/10.1371/journal.pone.0322717.g012 Table3.Comparisonofrecognitionperformancewithoutnormalization. Method Acc±SD(%) AverageTrainingTime AverageInferenceTime (mm:ss.ms) (mm:ss.ms) BiGRU 35.74±3.38 33:59.9 00:00.1 Conv1D+BiGRU 29.65±2.72 35:27.6 00:00.1 ESN300reservoir 42.21±1.02 00:58.9 00:06.2 ESN510reservoir 46.16±0.48 02:11.3 00:09.4 groupedESN[34]150*2reservoir 42.25±0.49 00:54.4 00:05.5 groupedESN[34]255*2reservoir 45.66±0.40 01:56.5 00:09.2 groupedESN[34]100*3reservoir 42.48±1.42 00:57.9 00:05.6 groupedESN[34]170*3reservoir 46.71±0.53 01:53.9 00:09.3 MRC150∗2reservoir 42.56±1.43 00:54.8 00:05.5 MRC255∗2reservoir 46.55±0.63 02:04.6 00:09.6 MRC100∗3reservoir 44.81±0.87 00:54.7 00:05.4 MRC170∗3reservoir 47.64±0.75 02:00.2 00:10.7 https://doi.org/10.1371/journal.pone.0322717.t003 best-performingmodelfromeachsession.Notably,inthecaseofRC,thelastweightisuti- lized,asupdatesoccuratthefinallayervia Eq(18). Amongthevariousconfigurationstested,theMRCexhibitedthehighestaccuracywith 300reservoirnodes,whichisthreeparallelreservoirswith100nodes,achievinganotable 60.35%,coupledwithacommendablylowSDof1.52%,asdetailedinTable5.Notably,MRC exhibitedsuperioraccuracycomparedwithitsdeeplearningcounterparts,particularlythe PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 22/ 30 ID: pone.0322717 — 2025/7/27 — page 23 — #23 PLOS ONE Low-cost computation for isolated sign language video recognition Table4.ComparisonofSLRperformanceusingnormalizationwithbothshouldersasreferencepoints. Method Acc±SD(%) AverageTrainingTime AverageInferenceTime (mm:ss.ms) (mm:ss.ms) BiGRU 46.94±2.15 36:44.5 00:00.1 Conv1D+BiGRU 40.54±1.14 39:17.9 00:00.2 ESN300reservoir 56.67±0.50 01:17.8 00:06.4 ESN510reservoir 56.01±1.27 02:05.0 00:09.3 groupedESN[34]150*2reservoir 56.82±1.32 01:12.7 00:06.4 groupedESN[34]255*2reservoir 54.96±1.19 02:00.6 00:09.3 groupedESN[34]100*3reservoir 56.09±1.71 01:14.0 00:06.3 groupedESN[34]170*3reservoir 55.50±0.58 02:01.4 00:09.8 MRC150∗2reservoir 55.97±0.94 01:12.0 00:05.9 MRC255∗2reservoir 55.31±1.29 02:03.0 00:09.0 MRC100∗3reservoir 56.43±0.83 01:15.7 00:05.9 MRC170∗3reservoir 55.58±0.33 01:58.2 00:09.1 https://doi.org/10.1371/journal.pone.0322717.t004 Table5.ComparisonofSLRperformanceusingnormalizationwithanoseasareferencepoint. Method Acc±SD(%) AverageTrainingTime AverageInferenceTime (mm:ss.ms) (mm:ss.ms) BiGRU 50.36±1.41 33:54.1 00:00.1 Conv1D+BiGRU 46.59±2.53 35:28.1 00:00.1 ESN300reservoir 58.64±1.56 00:58.8 00:05.1 ESN510reservoir 58.29±1.21 02:23.1 00:09.9 groupedESN[34]150*2reservoir 58.45±1.31 00:52.6 00:05.5 groupedESN[34]255*2reservoir 58.26±0.84 02:06.1 00:09.1 groupedESN[34]100*3reservoir 58.68±1.13 00:53.5 00:05.2 groupedESN[34]170*3reservoir 58.45±0.73 01:59.5 00:09.4 MRC150∗2reservoir 59.42±1.09 00:53.4 00:04.9 MRC255∗2reservoir 59.42±1.27 02:04.1 00:09.0 MRC100∗3reservoir 60.35±1.52 00:52.7 00:05.2 MRC170∗3reservoir 58.37±1.18 02:01.8 00:09.4 https://doi.org/10.1371/journal.pone.0322717.t005 BiGRUandConv1D+BiGRU.UponscrutinizingMRC’saccuracyagainstESNandgrou- pedESN,MRCconsistentlydemonstratedsuperiorperformance,asexemplifiedbyMRC300 andMRC510.Forexample,theMRC100*3configurationachievedanaccuracythatwas1.71 pointshigherthanthatofESN300,1.9pointshigherthanthatofgroupedESN150*2and 1.67pointshigherthanthatofgroupedESN150*3.However,notably,inoneinstance,the MRC170*3configurationdidnotoutperformthegroupedESN170*3configuration,although itdidexceedboththegroupedESN255*2andESN510configurations.Overall,thearrange- mentof300reservoirnodesbeats510nodesviaanidenticalapproach.Thisemphasizes theimportanceofselectingthenumberofreservoirnodesforanESN-basedmodel.Larger reservoirsizesdonotnecessarilyguaranteesuperiorperformanceinESN-basedmodels. Havingtoomanynodescannegativelyimpacttheabilityofthemodeltoeffectively distinguishbetweenfeatures. Significantdiscrepanciesintrainingtimesin Table5wereobservedbetweentheESN, MRC,andgroupedESNapproachescomparedwiththedeeplearningmethod.eTh BiGRU andConv1D+BiGRUmodelstook33:54.1and35:28.1minutes,respectively,whereasthe fastestESN-basedmodel,suchasMRC100*3,completedtrainingin0:52.7seconds.This demonstratestheadvantageoftheESN-basedmethodintermsofcomputationalefficiency duringtraining.Notably,theESN,MRC,andgroupedESNexhibitedcomparabletraining PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 23/ 30 ID: pone.0322717 — 2025/7/27 — page 24 — #24 PLOS ONE Low-cost computation for isolated sign language video recognition Fig13. Illustrationof (a)Asingleframefrom“accident”sign,(b)Aplotof“accident”keypointwithoutnormalization,(c)Aplotof“accident”keypointaeft r normalization,(d)Asingleframefrom“apple”sign,(e)aplotof“apple”keypointwithoutnormalization,and(f)Aplotof“apple”keypointaeft rnormalization. https://doi.org/10.1371/journal.pone.0322717.g013 timeswhenequivalentreservoirsizeswereemployed.Forexample,ESN510finishedtrain- ingat2:23.1minutes,whereasGroupedESN255*2required2:06.1minutes,andMRC170*3 achieved2:01.8minutes,indicatingthattheparallelreservoirdidnotincreasethetraining time. Furthermore,allalgorithms,includingdeeplearning,achievedremarkablyfastpro- cessingtimes,therebydemonstratingtheirpotentialforreal-timeapplicationsinSLR. BoththeBiGRUandConv1D+BiGRUhadnegligibleinferencetimesof00:00.1s,butthe ESN-basedmodelssuchasMRC100*3hadslightlygreaterinferencetimesbutstillhad efficientdurationsof00:05.2s.eTh inferencetimesacrosstheESN,groupedESN,andMRC werealllessthan10s. Overall,MRC100*3demonstratedthebestbalancebetweenperformanceandcompu- tationalefficiency,attainingthehighestaccuracywithaminimaltrainingperiodandrapid inferencetime.esTh efindingsrenderMRCidealfortasksthatnecessitaterapidmodel updatesandreal-timerecognition. Inthefinalscenario,acomparativeanalysiswasperformedbetweenourproposedmethod andexistingalgorithms,andalloftheseapproachesusedeeplearning.Table6presentsa comprehensiveoverviewoftherecognitionperformance,whereaccuracy(Acc)servesasthe PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 24/ 30 ID: pone.0322717 — 2025/7/27 — page 25 — #25 PLOS ONE Low-cost computation for isolated sign language video recognition metricforevaluatingcorrectnessindatasetrecognition,consideringtop-kaccuracy,includ- ingtop-1,top-5,andtop-10.eTh averagetrainingtimeisreportedinhours,minutes,seconds, andmicroseconds(hh:mm:ss.ms),whereastheinferencetimeisrecordedinminutes,sec- onds,andmicroseconds(mm:ss.ms).Additionally,the“Device”columnindicateswhether theprogramwasexecutedonaGPUoraCPU.eTh analysiswasconductedviatheavailable codefromLietal.[9]forouranalysis. eTh resultsdemonstratethatMRCachievescompetitiveperformancewhilesignificantly reducingtrainingtime.DespiteI3Dattainingthehighesttop-1accuracy,itcomesatthecost ofprolongedtrainingandinferencetimes,makingitcomputationallyexpensive.Bycontrast, MRCachievesthebesttop-5andtop-10accuracieswhentraininginlessthanoneminute, highlightingitsefficiency.Additionally,MRCistheonlyapproachthatoperatesentirelyon aCPU,makingitmoreaccessiblethanGPU-dependentmodels.Pose-TGCNachievessolid performancebutisslightlyoutperformedbyMRCintermsofthetop-5andtop-10accura- cies.Pose-GRUexhibitsloweraccuracythantheothermethods,whereasMOPGRUshows promisingperformancebutlackscompletebenchmarkingdata.esTh efindingssuggestthat MRCprovidesahighlyefficientandpracticalalternativeforsignlanguagerecognitiononthe WLASL100dataset. eTh algorithmsunderscrutinyincludePose-TGCN,Pose-GRU,I3D,MOPGRU,andMRC. I3Dachievedthehighesttop-1accuracy,withascoreof65.89%,followedbytheMOPGRU, whichachievedascoreof63.18%.OurproposedMRCsecuredthethird-highestaccuracy, reaching60.35%,whichsurpassedtheperformanceofboththePose-GRU(55.43%)and Pose-TGCN(46.51%).Furthermore,MRCachievedthebesttop-5(84.65%)andtop-10 (91.51%)accuracies,demonstratingitsrobustnessinrecognizingsignlanguagevariations.In particular,MRCachievedthiscompetitiveperformance,withasubstantiallyshortertrain- ingtimeof00:00:52.7minutesandaninferencetimeof00:05.2secondswhilerunningon aCPU.ThisunderscoresthecomputationalefficiencyofMRCincomparisonwithother GPU-dependentmodels,suchasI3D,whichrequiresmorethan20hoursoftraining.This highlightsthecompetitiveperformanceofMRCoverdeeplearningapproaches,whichareall achievedatanefficientcomputationalcost.AnotherkeyadvantageoftheMRCmodelisits abilitytorunonaCPU,asopposedtoothermodels,whichrequireGPUacceleration.This enablesMRCtobeimplementedinlow-powerandedgecomputingcontextswhilemaintain- ingreal-timeperformance. Discussion Inthesubsectionpresentingtheexperimentalresults,wepresentedaseriesofexperiments, includingsensitivityanalysis,normalization,andcomparisonswithstate-of-the-artalgo- rithms.Asensitivityanalysiswasperformedtovalidatethehyperparametersuggestionsfrom Optuna,andtheresultsconfirmedtheircorrectness.Giventheinherentvariabilityinthe Table6.AccuracycomparisonofdifferentapproachesonWLASL100. Method Acctop-1(%) Acctop-5(%) Acctop-10(%) TrainingTime InferenceTime Device hh:mm:ss.ms mm:ss.ms Pose-TGCN[9] 55.43 78.68 87.60 00:38:18.9 00:04.2 GPU Pose-GRU[9] 46.51 76.74 85.66 - - - I3D[9] 65.89 84.11 89.92 20:13:42.5 00:12.5 GPU MOPGRU[11] 63.18 - - - - - MRC 60.35 84.65 91.51 00:00:52.7 00:05.2 CPU https://doi.org/10.1371/journal.pone.0322717.t006 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 25/ 30 ID: pone.0322717 — 2025/7/27 — page 26 — #26 PLOS ONE Low-cost computation for isolated sign language video recognition signer’spositionandpostureacrossvideos,weunderscoretheimportanceofnormalization inSLRforenhancingaccuracy.eTh primaryobjectiveofnormalizationistomitigatediscrep- anciesinkeypointpositions,ensuringthattheyexistoncomparablescales,therebydimin- ishingtheimpactofsigner-specificvariationsinpositionsandpostures.esTh evariations, devoidofdistinctiveness,canpotentiallyaeff cttheaccuracyofSLRalgorithms.Inthisstudy, normalizationwascenteredaroundthenoseasareferencepoint,givenitsrelativestability. Additionally,forcomparison,wealsoappliednormalizationusingtheshouldersasarefer- encepoint.However,theresultsshowedthatnormalizationbasedonthenoseoutperformed theshoulder-basedapproach.Thismaybeduetotheinherentinstabilityoftheshoulder positioncomparedwiththatofthenosebecausethenoseisnotaeff ctedbyhandmovement, andtheheadofthesignerisrelativelystable.eTh experimentaloutcomerevealedperfor- manceenhancementsinallalgorithmsfollowingkeypointnormalization. Wepositedthataugmentingfeaturesandutilizingleakratescouldenhancetheefficacyof theESNalgorithm,aconjecturesupportedbythesuperiorperformanceexhibitedbyMRC overESN,groupedESN,andvariousdeeplearningalgorithms.Notably,thereservoirsizein theESN-basedalgorithmremainedconstantacrosstheexperiments.eTh principaldistinc- tionarosefromtheincorporationofdistinctleakageratesforeachreservoirwithinthemul- tireservoirstructureoftheMRC.Thisleakrategovernstheextenttowhichthepriorstate isretained,influencingthenetwork’scapacitytostoreinformation,asoutlinedin Eq(28). Ahigherleakrateimpliesadiminishedimpactfromhistoricalstates,allowingthemodelto prioritizenewinputs. OurexperimentalresultsdemonstratedthatMRCconsistentlyoutperformedESN-based models,especiallywhenthereservoirsizewassetto300nodes.Inoneinstance,ESN-based approacheswith510reservoirnodesexhibitedperformanceinferiortothatof300reser- voirnodes.Thisdiscrepancymightstemfromtheincreaseddifficultyindistinguishing moreextractedfeaturesandmisaligninghyperparametercombinations.eTh performanceof ESN-basedalgorithmsisintricatelytiedtovarioushyperparameters,includingsparsity,the reservoirspectralradius,inputweightscaling,andreadoutweightregularization.Thishigh- lightstheimportanceofcarefullytuninghyperparametersinESN-basedapproachestoavoid reducingthemodel’sabilitytogeneralize. AlltheMRCsandthetwoESN-basedalgorithmsexhibitfastertrainingtimesthantheir deeplearningcounterparts.Thisefficiencystemsfromtheinherentlysimplerlearningpro- cessembeddedinESN-basedalgorithms,asopposedtothedeeplearningalgorithm’suti- lizationofbackpropagation.IntheESN-basedparadigm,thelearningunfoldssolelyduring thereadoutphase,employingEq(18).Comparedwiththeirdeeplearningcounterparts,the linearmodelunderpinningtheoutputlayercontributestothelowercomputationaldemands ofESN-basedalgorithms.eTh expeditioustrainingtimeassumessignificanceinSLRforits potentialscalability,enablingthetrainingofmoreextensivedatasetswithinareasonable timeframe.Moreover,theacceleratedtrainingprocessallowsfortheimplementationof real-timeapplicationsbyexpeditingthedeploymentandenhancementofmodels. eTh inferencetimeofthealgorithmremainsconsistentlylessthan10seconds.Ingeneral, theinferencetimeofanESN-basedalgorithmwithanidenticalreservoirsizeshouldexhibit uniformity.However,inthisresearch,slightdisparitiesareobserved,likelyattributableto variationsincomputationalresources,suchasavailablememoryduringprogramexecution. Notably,theinferencetimeofthedeeplearningalgorithmsurpassesthatoftheESN-based model.Thisphenomenonispotentiallyattributedtothemoreefficientimplementationof thedeeplearningframeworkincomparisontothedevelopedESN.Uponscrutinizingthe processingmatricesofeachlayerinESN-basedalgorithmsanddeeplearning,asdepicted inFigs6(b),8,and9,adiscernibledifferenceemerges. Fig6(b)illustratestheoutputmatrix PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 26/ 30 ID: pone.0322717 — 2025/7/27 — page 27 — #27 PLOS ONE Low-cost computation for isolated sign language video recognition shapeoftheESN-basedalgorithminthisstudy,specificallyMRC510,whichisequivalentto ESN510andgroupedESN510.ThisfigureprovidesinsightintotheESN-basedalgorithm’s streamlinedprocessesforpredictinglabelscomparedwiththemoreintricatenatureofdeep learning.Forinstance,Fig6(b)showsthecomplexityofthedeeplearningapproach,which comprisesthreelayersofbidirectionalgatedrecurrentunits(BiGRU:BiGRU1,BiGRU2,and BiGRU3)housingnumerousBiGRUcells.EachBiGRUcell,inturn,encompassesfourinde- pendentlyfunctioninggates,operatingbothforwardandbackward.eTh findingthatESN processesfewermatricesthandeeplearningunderscorestheformer’sefficiencyindemanding fewercomputationalresourcesthanitsdeeplearningcounterpartdoes. Acomparisonwasconductedbetweentheproposedmethodandotherapproaches, includingPose-TGCN,Pose-GRU,I3D,andMOPGRU.Allofthecomparisonalgorithms employedadeeplearningarchitecturetodeveloptheSLRsystemandutilized2Dkeypoints extractedbyOpenPose[38]forTGCN,Pose-GRUandI3D,whereasMOPGRUemployed MediaPipe,whichissimilartotheproposedmethod.Inaddition,I3Dcombinesspatialand temporalfeatures.eTh proposedmethoddemonstratedcomparableperformancetothedeep learningapproach,whichachieved60.35%,outperformingPose-TGCNandPose-GRU.I3D achievedthehighesttop-1accuracyat65.89%becauseI3Dcontainshighmodelcapacity owingtoitshighnumberofparameters.erTh efore,I3DreliesonextensiveGPUtraining,and thetrainingtimeexceeds20honaGPU,whichlimitsitspracticalapplicabilityinthecon- textofdevicetraining.Bycontrast,MRCachievedthebestaccuracyinthetop-5(84.65%) andthetop10(91.51%),demonstratingitsabilitytocapturethefeaturesofsignlanguage. Ouralgorithmleveragedthedynamicsinthereservoirlayertorepresenttheinputfeature, andthemultiplereservoirmodelenabledtheextractionoffeatureswithgreatervariation thanastandardreservoir.eTh proposedmethod,MRC,hasbeenshowntoachieveabal- ancebetweenefficiencyandaccuracy.MRCdemonstratedthatitachievedatop-1accuracy of60.35%withatrainingtimeof52.7sonaCPU,thussubstantiatingitsfeasibilityforedge computing. Conclusions Inthisstudy,weexploredtheperformanceofESNsthroughthestandardESN,MRC,and groupedESNapproaches.eTh findingsofthisstudyindicatethattheproposedMRCmethod, whichincorporatesvariousleakrates,enhancesfeaturerepresentation,enablingthenetwork toacquireamoreprofoundunderstandingthanthestandardESN.Consequently,itdemon- stratescompetitiveperformancewhenjuxtaposedwithdeeplearningapproaches,achiev- ing60.35%top-1accuracy,84.65%top-5accuracy,and91.51%top-10accuracy.Moreover, MRChasefficiencyadvantages,requiringlesstrainingtimeandfewerresourcesthandeep learningdoes,whichisattributedtoitsstreamlinedprocessesandreducednumberofmatrix computationswithintheESN.ThisimpliesthefeasibilityofdeployingRCsonportable deviceswithconstrainedcomputationalresources,suchaslimitedRAMandprocessors. Althoughtheresultsarepromising,theyfallshortofachievingstate-of-the-artbench- marks.Futureresearcheffortswillfocusonrefiningaccuracybyemployingamodified ESNinconjunctionwithothermachinelearningmethods.Additionally,weaimtoimple- mentmultiplereservoirsonembeddedhardware,suchasfield-programmablegatearrays (FPGA)[39–41],andexplorephysicalRC.Thisapproachwillempoweruserstocarrythe systemportablyanddeployitasneeded. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 27/ 30 ID: pone.0322717 — 2025/7/27 — page 28 — #28 PLOS ONE Low-cost computation for isolated sign language video recognition Author contributions Conceptualization:ArieRachmadSyulistyo,YuichiroTanaka. Datacuration:ArieRachmadSyulistyo. Formalanalysis:ArieRachmadSyulistyo. Fundingacquisition:YuichiroTanaka,DindaPramanta,HakaruTamukoh. Methodology:ArieRachmadSyulistyo,YuichiroTanaka. Sowa ft re: ArieRachmadSyulistyo. Supervision:YuichiroTanaka,HakaruTamukoh. Validation:ArieRachmadSyulistyo. Visualization:ArieRachmadSyulistyo. Writing–originaldraft: ArieRachmadSyulistyo. Writing–review&editing:ArieRachmadSyulistyo,YuichiroTanaka,DindaPramanta, NinnartFuengfusin,HakaruTamukoh. References 1. Shah F, Shah MS, Akram W, Manzoor A, Mahmoud RO, Abdelminaam DS. Sign language recognition using multiple kernel learning: a case study of Pakistan sign language. IEEE Access. 2021;9:67548–58. https://doi.org/10.1109/access.2021.3077386 2. WHO. Deafness and hearing loss. https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss. 2024. 3. Kamal SM, Chen Y, Li S, Shi X, Zheng J. Technical approaches to Chinese sign language processing: a review. IEEE Access. 2019;7:96926–35. https://doi.org/10.1109/access.2019.2929174 4. Natarajan B, Rajalakshmi E, Elakkiya R, Kotecha K, Abraham A, Gabralla LA, et al. Development of an end-to-end deep learning framework for sign language recognition, translation, and video generation. IEEE Access. 2022;10:104358–74. https://doi.org/10.1109/access.2022.3210543 5. Al-Qurishi M, Khalid T, Souissi R. Deep learning for sign language recognition: current techniques, benchmarks, and open issues. IEEE Access. 2021;9:126917–51. https://doi.org/10.1109/access.2021.3110912 6. Bilge YC, Cinbis RG, Ikizler-Cinbis N. Towards zero-shot sign language recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(1):1217–32. https://doi.org/10.1109/TPAMI.2022.3143074 PMID: 7. Hua H, Li Y, Wang T, Dong N, Li W, Cao J. Edge computing with artificial intelligence: a machine learning perspective. ACM Comput Surv. 2023;55(9):1–35. https://doi.org/10.1145/3555802 8. Rastgoo R, Kiani K, Escalera S. Sign language recognition: a deep survey. Expert Syst Appl. 2021;164:113794. https://doi.org/10.1016/j.eswa.2020.113794 9. Li D, Rodriguez C, Yu X, Li H. Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: Proceedings on IEEE Winter Conference on Applications of Computer Vision. 2020, pp. 1459–69. 10. Subramanian B, Olimov B, Kim J. Fast convergence GRU model for sign language recognition. J Korea Multimedia Soc. 2022;25(9):1257–65. 11. Subramanian B, Olimov B, Naik SM, Kim S, Park K-H, Kim J. An integrated MediaPipe-optimized GRU model for Indian sign language recognition. Sci Rep. 2022;12(1):11964. https://doi.org/10.1038/s41598-022-15998-7 PMID: 35831393 12. Lugaresi C, Tang J, Nash H, McClanahan C, Uboweja E, Hays M, et al. MediaPipe: a framework for perceiving and processing reality. In: Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019; 2019. Available from: https: //mixedreality.cs.cornell.edu/s/NewTitle_May1_MediaPipe_CVPR_CV4ARVR_Workshop_2019.pdf. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 28/ 30 ID: pone.0322717 — 2025/7/27 — page 29 — #29 PLOS ONE Low-cost computation for isolated sign language video recognition 13. Abdelsattar M, Abdelmoety A, Ismeil MA, Emad-Eldeen A. Automated defect detection in solar cell images using deep learning algorithms. IEEE Access. 2025;13:4136–57. https://doi.org/10.1109/access.2024.3525183 14. Abdelsattar M, A Ismeil M, Menoufi K, AbdelMoety A, Emad-Eldeen A. Evaluating machine learning and deep learning models for predicting wind turbine power output from environmental factors. PLoS One. 2025;20(1):e0317619. https://doi.org/10.1371/journal.pone.0317619 PMID: 39847588 15. Lukoševičius M, Jaeger H. Reservoir computing approaches to recurrent neural network training. Comput Sci Rev. 2009;3(3):127–49. https://doi.org/10.1016/j.cosrev.2009.03.005 16. Maass W, Natschläger T, Markram H. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 2002;14(11):2531–60. https://doi.org/10.1162/089976602760407955 PMID: 12433288 17. Jaeger H. The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148. 2001. Available from: https://api.semanticscholar.org/CorpusID:15467150 18. Tanaka Y, Tamukoh H. Reservoir-based convolution. NOLTA. 2022;13(2):397–402. https://doi.org/10.1587/nolta.13.397 19. Tanaka G, Yamane T, Héroux JB, Nakane R, Kanazawa N, Takeda S, et al. Recent advances in physical reservoir computing: a review. Neural Netw. 2019;115:100–23. https://doi.org/10.1016/j.neunet.2019.03.005 PMID: 30981085 20. Kawashima I, Katori Y, Morie T, Tamukoh H. An area-efficient multiply-accumulation architecture and implementations for time-domain neural processing. In: 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE; 2021, pp. 1–4. https://doi.org/10.1109/icfpt52863.2021.9609809 21. Usami Y, van de Ven B, Mathew DG, Chen T, Kotooka T, Kawashima Y, et al. In-materio reservoir computing in a sulfonated polyaniline network. Adv Mater. 2021;33(48):e2102688. https://doi.org/10.1002/adma.202102688 PMID: 34533867 22. Honda K, Tamukoh H. A hardware-oriented echo state network and its FPGA implementation. JRNAL. 2020;7(1):58. https://doi.org/10.2991/jrnal.k.200512.012 23. Bianchi FM, Scardapane S, Lokse S, Jenssen R. Reservoir computing approaches for representation and classification of multivariate time series. IEEE Trans Neural Netw Learn Syst. 2021;32(5):2169–79. https://doi.org/10.1109/TNNLS.2020.3001377 PMID: 32598284 24. Yasumuro M, Jin’no K. Japanese fingerspelling identification by using MediaPipe. NOLTA. 2022;13(2):288–93. https://doi.org/10.1587/nolta.13.288 25. Bajaj Y, Malhotra P. American sign language identification using hand trackpoint analysis. In: International Conference on Innovative Computing and Communications. Singapore: Springer; 2022, pp. 159–71. 26. Attia NF, Ahmed MTFS, Alshewimy MAM. Efficient deep learning models based on tension techniques for sign language recognition. Intell Syst Appl. 2023;20:200284. https://doi.org/10.1016/j.iswa.2023.200284 27. Takayama N, Benitez-Garcia G, Takahashi H. Masked batch normalization to improve tracking-based sign language recognition using graph convolutional networks. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). 2021, pp. 1–5. 28. Luqman H. An efficient two-stream network for isolated sign language recognition using accumulative video motion. IEEE Access. 2022;10:93785–98. https://doi.org/10.1109/access.2022.3204110 29. Samaan GH, Wadie AR, Attia AK, Asaad AM, Kamel AE, Slim SO, et al. MediaPipe’s landmarks with RNN for dynamic sign language recognition. Electronics. 2022;11(19):3228. https://doi.org/10.3390/electronics11193228 30. Lukoševičius M, Jaeger H, Schrauwen B. Reservoir computing trends. Künstl Intell. 2012;26(4):365–71. https://doi.org/10.1007/s13218-012-0204-5 31. Martinuzzi F, Rackauckas C, Abdelrehim A, Mahecha M, Mora K. Reservoircomputing.jl: an efficient and modular library for reservoir computing models. J Mach Learn Res. 2022;23(288):1–8. 32. Maass W, Markram H. On the computational power of circuits of spiking neurons. J Comput Syst Sci. 2004;69(4):593–616. https://doi.org/10.1016/j.jcss.2004.04.001 33. Ma Q, Shen L, Cottrell GW. DeePr-ESN: a deep projection-encoding echo-state network. Inform Sci. 2020;511:152–71. https://doi.org/10.1016/j.ins.2019.09.049 34. Li Z, Tanaka G. Multi-reservoir echo state networks with sequence resampling for nonlinear time-series prediction. Neurocomputing. 2022;467:115–29. https://doi.org/10.1016/j.neucom.2021.08.122 35. Gallicchio C, Micheli A, Pedrelli L. Deep reservoir computing: a critical experimental analysis. Neurocomputing. 2017;268:87–99. https://doi.org/10.1016/j.neucom.2016.12.089 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 29/ 30 ID: pone.0322717 — 2025/7/27 — page 30 — #30 PLOS ONE Low-cost computation for isolated sign language video recognition 36. Li Z, Liu Y, Tanaka G. Multi-reservoir echo state networks with Hodrick–Prescott filter for nonlinear time-series prediction. Appl Soft Comput. 2023;135:110021. https://doi.org/10.1016/j.asoc.2023.110021 37. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM Press; 2019, pp. 2623–31. https://doi.org/10.1145/3292500.3330701 38. Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y. OpenPose: real time multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell. 2021;43(1):172–86. https://doi.org/10.1109/TPAMI.2019.2929257 PMID: 31331883 39. Tanaka Y, Morie T, Tamukoh H. An amygdala-inspired classical conditioning model implemented on an FPGA for home service robots. IEEE Access. 2020;8:212066–78. https://doi.org/10.1109/access.2020.3038161 40. Yoshioka K, Tanaka Y, Tamukoh H. LUTNet-RC: look-up tables networks for reservoir computing on an FPGA. In: 2023 International Conference on Field Programmable Technology (ICFPT). 2023, pp. 170–8. 41. Yoshioka K, Katori Y, Tanaka Y, Nomura O, Morie T, Tamukoh H. FpgA implementation of a chaotic Boltzmann machine annealer. In: 2023 International Joint Conference on Neural Networks (IJCNN). IEEE; 2023, pp. 1–8. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 30/ 30 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png PLoS ONE Public Library of Science (PLoS) Journal http://www.deepdyve.com/lp/public-library-of-science-plos-journal/low-cost-computation-for-isolated-sign-language-video-recognition-with-P0K0M0nUuP

Loading next page...

References (42)

C. Gallicchio, A. Micheli, L. Pedrelli (2017)
Deep reservoir computing: A critical experimental analysis
Neurocomputing, 268
F Shah (2021)
Sign language recognition using multiple kernel learning: a case study of Pakistan sign language
IEEE Access, 9
W. Maass, H. Markram (2004)
On the computational power of circuits of spiking neurons
J. Comput. Syst. Sci., 69
C Lugaresi (2019)
MediaPipe: a framework for perceiving and processing reality
Bharatram Natarajan, Rajalakshmi E, R. Elakkiya, K. Kotecha, Ajith Abraham, L. Gabralla, V. Subramaniyaswamy (2022)
Development of an End-to-End Deep Learning Framework for Sign Language Recognition, Translation, and Video Generation
IEEE Access, 10
W. Maass, T. Natschläger, H. Markram (2002)
Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations
Neural Computation, 14
H. Jaeger (2001)
The''echo state''approach to analysing and training recurrent neural networks
Zhe Cao, Gines Hidalgo, T. Simon, S. Wei, Yaser Sheikh (2018)
OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
IEEE Transactions on Pattern Analysis and Machine Intelligence, 43
Razieh Rastgoo, K. Kiani, Sergio Escalera (2021)
Sign Language Recognition: A Deep Survey
Expert Syst. Appl., 164
M. Lukoševičius, H. Jaeger (2009)
Reservoir computing approaches to recurrent neural network training
Comput. Sci. Rev., 3
Natsuki Takayama, Gibran Benitez-Garcia, Hiroki Takahashi (2021)
Masked Batch Normalization to Improve Tracking-Based Sign Language Recognition Using Graph Convolutional Networks
2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021)
Qianli Ma, Lifeng Shen, G. Cottrell (2020)
DeePr-ESN: A deep projection-encoding echo-state network
Inf. Sci., 511
H. Hua, Yutong Li, Tonghe Wang, Nanqing Dong, Wei Li, Junwei Cao (2022)
Edge Computing with Artificial Intelligence: A Machine Learning Perspective
ACM Computing Surveys, 55
Yugam Bajaj, P. Malhotra (2020)
American Sign Language Identification Using Hand Trackpoint Analysis
ArXiv, abs/2010.10590
Masanao Yasumuro, K. Jin'no (2022)
Japanese fingerspelling identification by using MediaPipe
Nonlinear Theory and Its Applications, IEICE
(2019)
perceiving and processing reality
Kanta Yoshioka, Yuichi Katori, Yuichiro Tanaka, O. Nomura, T. Morie, H. Tamukoh (2023)
FPGA Implementation of a Chaotic Boltzmann Machine Annealer
2023 International Joint Conference on Neural Networks (IJCNN)
Montaser Abdelsattar, Mohamed Ismeil, Karim Menoufi, Ahmed Abdelmoety, Ahmed Emad-Eldeen (2025)
Evaluating Machine Learning and Deep Learning models for predicting Wind Turbine power output from environmental factors
PLOS ONE, 20
Yuichiro Tanaka, T. Morie, H. Tamukoh (2020)
An Amygdala-Inspired Classical Conditioning Model Implemented on an FPGA for Home Service Robots
IEEE Access, 8
(2019)
Optuna
S. Kamal, Yidong Chen, Shaozi Li, X. Shi, Jiangbin Zheng (2019)
Technical Approaches to Chinese Sign Language Processing: A Review
IEEE Access, 7
B Subramanian (2022)
Fast convergence GRU model for sign language recognition
J Korea Multimedia Soc, 25
Muhammad Al-Qurishi, Thariq Khalid, R. Souissi (2021)
Deep Learning for Sign Language Recognition: Current Techniques, Benchmarks, and Open Issues
IEEE Access, 9
M. Lukoševičius, H. Jaeger, B. Schrauwen (2012)
Reservoir Computing Trends
KI - Künstliche Intelligenz, 26
Gerges Samaan, Abanoub Wadie, Abanoub Attia, Abanoub Asaad, Andrew Kamel, Salwa Slim, Mohamed Abdallah, Young-Im Cho (2022)
MediaPipe’s Landmarks with RNN for Dynamic Sign Language Recognition
Electronics
Yuichiro Tanaka, H. Tamukoh (2022)
Reservoir-based convolution
Nonlinear Theory and Its Applications, IEICE
Yunus Bilge, R. Cinbis, Nazli Ikizler-Cinbis (2022)
Towards Zero-Shot Sign Language Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence, 45
Nehal Attia, Mohamed Ahmed, M. Alshewimy (2023)
Efficient deep learning models based on tension techniques for sign language recognition
Intell. Syst. Appl., 20
Montaser Abdelsattar, Ahmed Abdelmoety, M. Ismeil, Ahmed Emad-Eldeen (2025)
Automated Defect Detection in Solar Cell Images Using Deep Learning Algorithms
IEEE Access, 13
F. Bianchi, Simone Scardapane, Sigurd Løkse, R. Jenssen (2018)
Reservoir Computing Approaches for Representation and Classification of Multivariate Time Series
IEEE Transactions on Neural Networks and Learning Systems, 32
K. Honda, H. Tamukoh (2020)
A Hardware-Oriented Echo State Network and its FPGA Implementation
J. Robotics Netw. Artif. Life, 7
Dongxu Li, Cristian Rodriguez-Opazo, Xin Yu, Hongdong Li (2019)
Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison
2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
G. Tanaka, T. Yamane, J. Héroux, R. Nakane, Naoki Kanazawa, Seiji Takeda, H. Numata, D. Nakano, A. Hirose (2018)
Recent Advances in Physical Reservoir Computing: A Review
Neural networks : the official journal of the International Neural Network Society, 115
Ziqiang Li, Yun Liu, G. Tanaka (2023)
Multi-Reservoir Echo State Networks with Hodrick-Prescott Filter for nonlinear time-series prediction
Appl. Soft Comput., 135
Hamzah Luqman (2022)
An Efficient Two-Stream Network for Isolated Sign Language Recognition Using Accumulative Video Motion
IEEE Access, 10
Ichiro Kawashima, Yuichi Katori, T. Morie, H. Tamukoh (2021)
An area-efficient multiply-accumulation architecture and implementations for time-domain neural processing
2021 International Conference on Field-Programmable Technology (ICFPT)
Y. Usami, Bram Ven, Dilu Mathew, Tao Chen, T. Kotooka, Yuya Kawashima, Yuichiro Tanaka, Y. Otsuka, H. Ohoyama, H. Tamukoh, Hirofumi Tanaka, W. Wiel, T. Matsumoto (2021)
In‐Materio Reservoir Computing in a Sulfonated Polyaniline Network
Advanced Materials (Deerfield Beach, Fla.), 33
Kanta Yoshioka, Yuichiro Tanaka, H. Tamukoh (2023)
LUTNet-RC: Look-Up Tables Networks for Reservoir Computing on an FPGA
2023 International Conference on Field Programmable Technology (ICFPT)
MediaPipe: a framework for
Francesco Martinuzzi, Chris Rackauckas, Anas Abdelrehim, M. Mahecha, Karin Mora (2022)
ReservoirComputing.jl: An Efficient and Modular Library for Reservoir Computing Models
ArXiv, abs/2204.05117
Ziqiang Li, G. Tanaka (2021)
Multi-reservoir echo state networks with sequence resampling for nonlinear time-series prediction
Neurocomputing, 467
B. Subramanian, Bekhzod Olimov, Shraddha Naik, Sangchul Kim, Kil-Houm Park, Jeonghong Kim (2022)
An integrated mediapipe-optimized GRU model for Indian sign language recognition
Scientific Reports, 12

Publisher: Public Library of Science (PLoS) Journal
Copyright: Copyright: © 2025 Syulistyo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The data underlying the results presented in the study are available from https://dxli94.github.io/WLASL/ and please contact [email protected] for further assistance. Funding: JST ALCA-Next (https://www.jst.go.jp/alca/en/index.html): (a) JPMJAN23F3 = Prof. Hakaru Tamukoh (https://researchmap.jp/read0109207?lang=en). JSPS KAKENHI (https://www.jsps.go.jp/english/e-grants/): (a) 23K28158, 23K18495 = Prof. Hakaru Tamukoh (https://researchmap.jp/read0109207?lang=en) (b) 23K28158, 22K17968 = Assoc.Prof. Yuichiro Tanaka (https://researchmap.jp/tanaka-yuichiro) (c) 23K28158 = Dinda Pramanta (https://researchmap.jp/read030909?lang=en). All funder did not participate in the research. This paper is supported by the NEDO project and the principal investigator (Prof. Takashi Morie (https://hyokadb02.jimu.kyutech.ac.jp/html/339_en.html)) is not directly related to this paper. However, the co investigators (Prof. Hakaru Tamukoh and Assoc. Prof. Yuichiro Tanaka) contribute to this paper. The New Energy and Industrial Technology Development Organization (https://www.nedo.go.jp/english/): Grant number JPNP16007. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.
eISSN: 1932-6203
DOI: 10.1371/journal.pone.0322717
Publisher site: See Article on Publisher Site

Abstract

Sign language recognition (SLR) has the potential to bridge communication gaps and empower hearing-impaired communities. To ensure the portability and accessibil- OPEN ACCESS ity of the SLR system, its implementation on a portable, server-independent device Citation:SyulistyoAR,TanakaY,PramantaD, becomes imperative. This approach facilitates usage in areas without internet connectiv- FuengfusinN,TamukohH(2025)Low-cost ity, addressing the need for data privacy protection. Although deep neural network mod- computationforisolatedsignlanguage video recognitionwithmultiplereservoircomputing. els are potent, their efficacy is hindered by computational constraints on edge devices. PLoSOne20(7):e0322717. This study delves into reservoir computing (RC), which is renowned for its edge-friendly https://doi.org/10.1371/journal.pone.0322717 characteristics. Through leveraging RC, our objective is to craft a cost-effective SLR Editor:FahdSaeedAlakbari,Universiti system optimized for operation on edge devices with limited resources. To enhance the TeknologiPetronas:UniversitiTeknologi, recognition capabilities of RC, we introduce multiple reservoirs with distinct leak rates, MALAYSIA extracting diverse features from input videos. Prior to feeding sign language videos Received:November8,2024 into the RC, we employ preprocessing via MediaPipe. This step involves extracting the Accepted:March26,2025 coordinates of the signer’s body and hand locations, referred to as keypoints, and nor- Published:July 30,2025 malizing their spatial positions. This combined approach, which incorporates keypoint extraction via MediaPipe and normalization during preprocessing, enhances the SLR Copyright:©2025 Syulistyo etal.Thisisan openaccessarticledistributedundertheterms system’s robustness against complex background effects and varying signer positions. oftheCreativeCommonsAttributionLicense, Experimental results demonstrate that the integration of MediaPipe and multiple reser- whichpermitsunrestricted use,distribution, voirs yields competitive outcomes compared with deep recurrent neural and echo state andreproductioninanymedium,providedthe networks and promises significantly lower training times. Our proposed MRC achieved originalauthorandsourcearecredited. accuracies of 60.35%, 84.65%, and 91.51% for the top-1, top-5, and top-10, respectively, Dataavailabilitystatement: Thedata on the WLASL100 dataset, outperforming the deep learning-based approaches Pose- underlyingtheresultspresentedinthestudyare availablefromhttps://dxli94.github.io/WLASL/ TGCN and Pose-GRU. Furthermore, because of the RC characteristics, the training [email protected] time was shortened to 52.7 s, compared with 20 h for I3D and the competitive inference furtherassistance. time. Funding:JSTALCA-Next (https://www.jst.go.jp/alca/en/index.html):(a) JPMJAN23F3=Prof.HakaruTamukoh PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 1/ 30 ID: pone.0322717 — 2025/7/27 — page 2 — #2 PLOS ONE Low-cost computation for isolated sign language video recognition (https://researchmap.jp/read0109207?lang=en). Introduction JSPSKAKENHI(https://www.jsps.go.jp/english/ Languageservesasavitalmeansofcommunication,eachwithsyntaxandgrammar[1].Sign e-grants/):(a) 23K28158,23K18495 =Prof. language,whichisutilizedbyindividualswithhearingimpairments,presentsauniquelin- HakaruTamukoh(https://researchmap.jp/ read0109207?lang=en)(b)23K28158, guisticform.eTh WorldHealthOrganization(WHO)estimatesthat,asof2021,430million 22K17968=Assoc.Prof.YuichiroTanaka peoplegrapplewithdeafness[2].Deafnessextendsitsimpactacrossvariousfacets,including (https://researchmap.jp/tanaka-yuichiro)(c) education,employment,socialdynamics,loneliness,andstigma.Despitetheuniversalrightto 23K28158=DindaPramanta equalopportunities,globaldisparitiespersist,notablyineducation.Communicationbarriers, (https://researchmap.jp/read030909?lang=en). especiallyforthosereliantonsignlanguage,contributetothisinequality. Allfunderdidnotparticipate intheresearch. Thispaperissupported bythe NEDOproject Challengesarisewhenindividualsusingsignlanguageattempttocommunicatewiththose andtheprincipalinvestigator (Prof.Takashi unfamiliarwithit,hinderingthesmoothexchangeofinformation[3].Advancedtechnologies Morie(https://hyokadb02.jimu.kyutech.ac.jp/ oerff apotentialsolution,bridgingthecommunicationgapbetweenhearing-impairedindi- html/339_en.html))isnotdirectlyrelatedtothis vidualsandothers.Apivotaltoolinthisregardisasignlanguagerecognition(SLR)system, paper.However,thecoinvestigators(Prof. whichprocessesinputstorecognizespecificlabels[ 4–6].Thisstudyaimstodevelopamodel HakaruTamukohandAssoc.Prof.Yuichiro requiringmodestcomputationalresourcesforintegrationintoedgedevices.eTh implemen- Tanaka)contributetothispaper.TheNew EnergyandIndustrialTechnologyDevelopment tationofSLRinedgecomputingoerff sadvantagessuchasportability,enhanceddataprivacy, Organization(https://www.nedo.go.jp/english/): reducedtransmissioncosts,andusabilityinareaslackinginternetconnectivity[7]. GrantnumberJPNP16007. Thefundershadno SLRresearchfallsintotwoprimarycategories[6]:continuousSLR,whichrecognizesone roleinstudydesign,datacollectionand ormorelabelsincontinuousstreaminput,andisolatedSLR,whichidentifiesonesignata analysis,decisionto publish,orpreparationof time.ThisstudyspecificallytargetsisolatedSLRwithlowcomputationalresourcerequire- themanuscript. ments.SLRcategorizationisbasedoninputtypes,distinguishingbetweenvision-based, Competinginterests:Theauthorshave sensor-based,andhybridapproaches[3,5,8].Vision-basedinputinvolvesimageorvideo declaredthatnocompetinginterests exist. acquisitionforprocessingthesigner’sposeinformation.Sensor-basedmethodsutilizewear- ablesensorstocapturehandgesturesandtheirpositions.Hybridapproachesintegratevision- basedcamerasandvarioussensors,suchasdepthcamerasensors.Giventheuser-friendly natureofvision-basedapproaches,particularlytheminimalrestraintimposedonuserscom- paredwithsensor-basedmethods,SLRresearcherspredominantlyemphasizevision-based systems.Calibrationchallengesbetweenvision-basedmodalitiesandwearablesensors,as encounteredinhybridsystems,canbeparticularlyintricate.Consideringtheadvantages ofthevision-basedapproachandpreviousstudies,thisstudyconcentratesonvision-based methodology,employingvideosasinput.Employinganempiricalmethod,theSLRfunction usesacameratocapturesignermovements,subsequentlyprocessingthemfurtherthrougha classificationalgorithm. eTh domainofSLRpresentsamultitudeofchallenges,encompassingdisparatevideo lengths,analogousgesturesaffiliatedwithdistinctlabels,variationsingestureswithinthe samelabel[9],andtheimperativeaspectofreal-timeSLR[8].Noteworthyendeavorshave beenundertakenbyscholars,includingLietal.[9],whoproposedasizableAmericanSign Languagevideodataset,therebycontributingtoapubliclyaccessiblerepository.Foraparal- leltrajectory,Subramanianetal.[10]devisedastreamlinedapproachbydevelopingamin- imizedgatedrecurrentunit(GRU)model.Thisinnovativemodelnotonlyexpeditescon- vergencebutalsomitigatesthecomputationaloverheadassociatedwiththeconventional GRU.Extendingtheircontributions,Subramanianetal.[11]suggestedthefusionofMedi- aPipe[12]withanoptimizedGRUarchitecture,ensuringefficientinformationprocessing. MediaPipe,aninstrumentcreatedbyGoogle,servesthepurposeofconstructingefficient on-devicemachinelearningpipelinestailoredfortheprocessingofvideo,image,text,and audio. eTh applicationofdeeplearninginSLRhasbeenfrequentowingtoitsinherentability toclassifybothspatialandtemporalfeaturesaccurately.eTh deeplearningsystemsapplied includepose-basedtemporalgraphconvolutionnetwork(Pose-TGCN)[9],pose-gated PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 2/ 30 ID: pone.0322717 — 2025/7/27 — page 3 — #3 PLOS ONE Low-cost computation for isolated sign language video recognition recurrentunit(Pose-GRU)[9],inflated3DConvNet(I3D)[ 9],andMediaPipeOptimized GRU(MOPGRU)[10].Recentstudieshaveproposedutilizingdeepneuralnetworks(DNNs) withSLRsystems.However,DNNspossessintricatearchitecturesthatheavilydependon GPUs,posingchallengesintheirimplementationonedgedevices[7]thatrequireasignif- icantamountofcomputation[13],whichcanleadtoincreasedpowerconsumptionand latency.Additionally,DNNstypicallyrequirelongtrainingtimes,whichcandelaymodel updates[14].Toovercomethesechallenges,analternativeapproachinvolvingRChasbeen suggested[12,15–17].RC,knownforitssuitabilityforlow-costreal-timecomputation,holds promiseforthedevelopmentofmachinelearninghardwaredevices[18–21].Itisessential tounderscoreRC’sproficiencyinclassifyingtemporalfeaturesrelevanttothisareaandits abilitytohandlemultivariatefeatures[22].Furthermore,thehypothesispositedbyLiand Tanakasuggeststhattheenrichmentoffeaturerepresentationsextractedfromtheinputcan leadtoimprovedaccuracy[23].Inthecontextofthisstudy,weproposetheintegrationof multiplereservoir-basedRCs(MRCs)withMediaPipeforSLR.Comparedwithconventional RC,MRCattainsamorecomprehensivefeaturerepresentation,employingdistinctleakrates withineachreservoirtoenhancelearningfromvideoinput.eTh proposedmethodprocesses temporalinputdata,specificallyhandandbodykeypointsextractedbyMediaPipefrominput videos.AdistinctivecontributionofthisstudyliesintheintegrationofMediaPipewithMRC, anaspectthathasnotbeenexploredinpreviousstudiesonSLRemployingechostatenetwork (ESN)-basedmethods. eTh primarycontributionsofthisstudyareasfollows: • Tothebestofourknowledge,thisstudyisthefirsttoemployRCforthetaskofSLR,oerff - inganovelapproachtothisdomain. • WeintroduceanRC-basedframeworkthatdemonstratesperformancecomparabletothat ofexistingdeeplearningmethodswhilesubstantiallyreducingthecomputationaltraining time. • eTh implementationismadepubliclyavailableasopen-sourcecodeat https://github.com/ tamukohlaboratory/MultipleReservoirComputing-MRC,promotingtransparencyand facilitatingfurtherresearchinthefield. eTh remainderofthispaperisstructuredasfollows:Section2providesanoverviewof relatedworkinSLR.Section3elucidatestheconceptofRC.InSection4,acomprehensive accountoftheresearchmethodologyunfolds,encompassingtheutilizeddataandanin-depth expositionoftheproposedmethod.Sections5,6,and7presenttheexperimentalresults, discusstheresults,anddrawconclusions,respectively. Related work eTh advancementofmachinelearninganddeeplearningalgorithmshasyieldedpromising resultsinSLR.SeveralstudieshavebeenconductedtosolvetheproblemofisolatedSLR.eTh inputtotheSLRcanbeclassifiedintostaticimagesandvideos.Throughanextensivereview oftheliterature,weidentifiedfourstudiesemployingstaticimagesasinputs:Shahetal.[ 1], YasumuroandJin’no[24],Bajajetal.[25],andAttiaetal.[26].esTh estudiesaresummarized inTable1. Shahetal.[1]pioneeredthedevelopmentofanSLRsystemtailoredfor36labelswithinthe contextofPakistanSignLanguage,predominantlyrelyingonvisionmodalities.eir Th method encompassesfourdistinctfeatureextractions,namely,speeded-uprobustfeatures(SURFs), localbinarypatterns(LBPs),edge-orientedhistograms(EOHs),andhistogramsoforiented PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 3/ 30 ID: pone.0322717 — 2025/7/27 — page 4 — #4 PLOS ONE Low-cost computation for isolated sign language video recognition Table1.Summaryofsignlanguagerecognitionresearch. RelatedWork Features Classifier UsedLabels Shahetal. [1] SURF,LBP,EOH,HOG SVMwithmultiplekernels 36Pakistan YasumuroandJin’no [24] Keypoints SVM 41Hiragana, 24Alphabetic Bajajetal. [25] Keypoint KNN,randomforestand 24American neuralnetwork Attiaetal. [26] CNN YOLOv5x+attention 36American, methods 66Bangla Lietal. [9] Keypoints TGCN 2000American Bilgeetal. [6] Spatial,temporal,textand GZSSLR 200American attribute Takayamaetal. [27] Keypoint SLGCN-Transformer 2000American, 275Japanese Subramanianetal. [11] Keypoints MOPGRU 12Indian, 100American, 64Argentinian Luqmanetal. [28] Spatialfeatures StackCNNandLSTM 502arabic, 64Argentinian Samaanetal. [29] Keypoints RNN 10American https://doi.org/10.1371/journal.pone.0322717.t001 gradients(HOGs).Eachfeaturespacesubsequentlyundergoesprocessingviatenfoldcross- validationtoascertaintheoptimalkernelamonglinear,Gaussian,andpolynomialsupport vectormachines(SVMs)intermsofachievingthehighestaverageaccuracy.Followingthis, thefeaturespaceassociatedwithaspecifickernel,demonstratingthehighestaverageaccu- racy,isselectedastheSVMkerneltoclassifytheoutputpertainingtothatparticularfeature space. YasumuroandJin’no[24]focusedontherecognitionofJapanesefingerspelling,employ- ingMediaPipe.eir Th approachinvolvestheutilizationofanSVMfortheclassificationtask asanalternativetodeeplearningmethods[25],aimingtoincreasecomputationalefficiency. eir Th studyemployedavideo,processingeachframeasinputtorecognizefingerspelling, encompassing24labelsforthealphabetand41labelsforthehiraganadatasets.Notably,the SVM-basedmethodologydemonstratedareductionincomputationtimecomparedwith deeplearningwhilesimultaneouslyachievingahigherrecognitionrate. Bajajetal.[25]undertookacomprehensiveinvestigationcomparingthreeclassifica- tionalgorithmsinthecontextofSLRsystems:K-nearestneighbor(KNN),randomforest, andneuralnetworks.eir Th researchexplored28distinctpreprocessingcombinationswith thegoalofenhancingtheclassificationalgorithm.eTh experimentalresultsrevealedthat theapplicationofpreprocessingtechniquessignificantlyimprovesaccuracy,withthemost effectivecombinationinvolvingrounding,shifting,andscaling.Moreover,theoptimalclas- sificationalgorithmidentifiedintheirstudywasaneuralnetworkcoupledwiththeaforemen- tionedpreprocessingtechnique. Attiaetal.[26]innovativelydevelopedthreedeeplearningmodelsbasedonYOLOv5x, incorporatingtwoattentionmethods:squeeze-and-excitationandaconvolutionalblock attentionmodulefortheSLRsystem.eTh datasetemployedforthestudycomprised36Amer- icanlabelsand66Banglalabels.eTh rationalebehindselectingYOLOv5x,anextensionof YOLOv5,asthefoundationalmodelliesinitslightweightandrapiddeploymentcapabilities ondiverseedgedevices.Itiscrucialtonote,however,thatthismodelnecessitatesbound- ingboxlabeling,renderingittrainablebutrequiringaconsiderabletimeinvestmentfor annotation. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 4/ 30 ID: pone.0322717 — 2025/7/27 — page 5 — #5 PLOS ONE Low-cost computation for isolated sign language video recognition AsshowninTable1,threeofthefourstudiesthatutilizedstaticimagesemployed classicalmachinelearning,whereasonestudyuseddeeplearning.Notably,considerable emphasishasbeenplacedbyresearchersonoptimizingthecomputationtimeofSLRsystems. Importantly,thepracticalapplicationofSLRinvolvestheanalysisofvideostoidentifylabels onthebasisofmotionsequences.Consequently,thisstudyintentionallyabstainedfromthe useofstaticimages,aligningwiththedynamicnatureinherentinSLRapplications.eTh chal- lengeencounteredintheisolatedSLRofvideoinputsrevolvesaroundthescarcityofpublicly availabledatasets.ThispredicamentwaseffectivelyaddressedbyLi[ 8]throughtheintroduc- tionoftheWord-LevelAmericanSignLanguage(WLASL)videodataset.eTh notablefea- turesofthisdatasetincludeaframerateof25framespersecond(fps)andavideoresolution of256×256.AmbiguityemergesasanotablechallengewithinWLASL.Thisambiguityman- ifestsininstanceswhereidenticalsignlanguagelabelsexhibitdifferentsigns.Furthermore, diversesignlanguagesmaypossessdistinctlabels,suchas“wish”and“hungry”,whilefeatur- ingsimilarsignsormovements[8].Liproposedamethoddesignedforrecognizingisolated signlanguage,denotedaspose-basedtemporalgraphconvolutionnetworks(Pose-TGCNs). ThismethodreliesonOpenPose[ 21]forextractingkeypoints,encompassing13upperbodies and21jointpointsforboththeleftandrighthands.Remarkably,thePose-TGCNdemon- stratescommendableperformance,particularlywhenconfrontedwithalimitedvocabulary sizeof100labels. Bilgeetal.[6]presentedanSLRsystemdesignedtoidentifynovelclassesthrough knowledgetransferfromthetrainingdataset,specificallyaddressingzero-shotlearningsign languagerecognition(ZSSLR)andgeneralizedZSSLR(GZSSLR).eTh authorsemployeda zero-shotlearning(ZSL)frameworktoextendtherecognitionmodel’sapplicabilitytoboth seenandunseenclasses,incorporatingvisualandauxiliaryclassrepresentations.ZSSLRand GZSSLRsharesimilarities,differingonlyinthetestdatautilized:ZSSLRfornovel,unseen testdataandGZSSLRforbothnovel,seen,andunseentestdata.Visualrepresentationswere extractedfromthespatiotemporaldeepmodelencompassingbodyandhandregions.An auxiliaryclassrepresentationwasderivedfromtextualdictionarydefinitionsandattribute combinations.eTh authorsintroducedthreebenchmarkdatasetsinthisstudy:ASL-Text, comprising250labels;andMS-ZSSLR-WandMS-ZSSLR-W,eachcontaining200labels. Despitepromisingresults,theaccuracy,althoughrelativelylowcomparedwiththatofother ZSLmethods,remainedbelow40%. Takayamaetal.[27]extendedbatchnormalizationindeeplearningtoinsertmaskedbatch normalization(MBN)inanexistingSLRsystem.eTh MBNnormalizedtheinputfeaturesin theGCNmodelwhilemaskingthedummysignals.eTh experimentaloutcomesrevealeda noteworthyenhancementintheaccuracyoftheGCN,establishingMBNasaneffectiveclassi- ficationalgorithm.Inthecontextofthisstudy,themostproficientalgorithmidentifiedwasa SignLanguageGraphConvolutionNetworkwithaTransformer(SLGCN-Transformer).This algorithmexhibitedsuperiorperformancewithintheexperimentalframework. Subramanianetal.[11]directedtheirresearchtowardIndianSLRinvolving12distinct classes.eTh authorsintroducedanoptimizedfusionofMediaPipeandaGRU,denotedasthe MOPGRU(MediaPipeOptimizedGatedRecurrentUnit),designedtoprocessvideodatasets effectively.WithintheMOPGRU,modificationswereappliedtotheupdatedgatesofthestan- dardGRU,ensuringthattheoutputsoftheresetgatesre-evaluatedtheinformation,eliminat- ingunwanteddataandprioritizingmeaningfulinformation.Furthermore,themethodpro- posedbytheresearchersunderwentacomparativeanalysiswithastate-of-the-artalgorithm employingWLASL100(WordLargeAmericanSignLanguagewith100labels). Luqmanetal.[28]devisedanSLRmodelthatsynergisticallyemploysaconvolutional neuralnetwork(CNN)andlongshort-termmemory(LSTM).Thisintegrationwasevaluated PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 5/ 30 ID: pone.0322717 — 2025/7/27 — page 6 — #6 PLOS ONE Low-cost computation for isolated sign language video recognition viadatasetscomprising502Arabicand64Argentiniansamples.eTh optimalconfiguration wasidentifiedthroughtheutilizationofstackedMobileNetforfeatureextraction,followedby subsequentprocessingwithstackedLSTM.Thiscombinationemergedasthemosteffectivein achievingthedesiredoutcomesintheirexperimentalframework. Samaanetal.[29]introducedthedynamicsignlanguage(DSL)10dataset,adataset comprising10labelsofASL.eir Th approachinvolvestheapplicationofRNN-basedmodels, suchasGRU,LSTM,andBiLSTM. Allsixstudiesfocusedonvideoinputs,asoutlinedinTable1,andemployeddeeplearning methodologies.AccordingtotheexperimentationconductedbySamaanetal.[29],theuse offacialkeypointsisnotadvisedbecauseofthesixfoldincreaseinprocessedfeatures,leading toheightenedcomputationaldemands.Thisresultsinextendedprocessingtimescompared withscenarioswherefacialkeypointsarenotemployed,whiletheachievedaccuracyremains comparable.Similarly,otherresearchers[11,24,26],and[29]alsoconsiderthecomputational efficiencyofSLR,acknowledgingitssignificanceinensuringstreamlinedprocessing.eTh collectivefindingsfromSLRresearchunderscorereal-timeimplementationonedgedevices asanongoingchallengewithinSLRsystems.Thisexplorationdrivesourresearchefforts, withafocusondevelopingacost-effectiveSLRsolutionapplicabletoedgedevicesadeptat classifyingdynamicinputs.Furthermore,ourproposedmethodcombinescomputationaleffi- ciencyandcompetitiveperformance,unlikedeeplearningmethods,whichoenft demand computationalpowerandtrainingtime. Reservoir computing ESN RCisinspiredbyanaturalphenomenon:whenadropletofwaterfallsontoastillwater surface,itgeneratesripplesthatspreadoutward.eTh patternandintensityoftheseripplesare determinedbythesizeandforceofthedroplet,asillustratedinFig1.erTh efore,observingthe watersurfacecananalyzewhatorhowdropletshavefallen. RCconsistsofinput,reservoir,andoutput,asshowninFig2.eTh watersurfacecanbe regardedasananalogyforthereservoir,withthedropletrepresentingtheinputsignal.Asthe dropletinteractswiththewater,itdisturbsthesurfaceandgeneratesacomplexripplepattern, analogoustohowinputtimeseriesdataaretransformedbythedynamicreservoirinRC.eTh reservoircapturestemporaldependenciesandmapstheinputintoahigh-dimensionalspace Fig1.Reservoirconceptdepictedwithripples. https://doi.org/10.1371/journal.pone.0322717.g001 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 6/ 30 ID: pone.0322717 — 2025/7/27 — page 7 — #7 PLOS ONE Low-cost computation for isolated sign language video recognition Fig2.BasicarchitectureofESN. https://doi.org/10.1371/journal.pone.0322717.g002 calledareservoirstate.Inthefinalstageofthemodel’sdevelopment,thereadoutemploysthe transformedstates,orripplepatterns,toconstructthemodelandperformclassification. RCpresentsarecurrentmodelcapableoftrainingwithoutrelyingonagradientdescent- basedapproach.ThisdesignseekstoovercomethechallengesassociatedwithRNNs,which areknownforbeingchallengingtotrainviagradientdescentmethodsandcomputation- allyintensive[30].IntheRCarchitecture,inputdataundergoprocessingwithinafixed randominternallayerknownasthereservoir,andtheoutputisgeneratedthroughalinear combination,oenft implementedaslinearregression[ 12].Comparedwiththedeeplearn- ingapproach,thismethodologyenablesRCstoachievefastercomputationtimeswithfewer parameters[31]. RCencompassestwoprimarytypes:ESNs[17]andliquidstatemachines(LSMs)[32].eTh primarydistinctionliesintheimplementationoftheneurons.ESNutilizesdiscretedynam- icsandrate-codedneuronsthatintegrateinputsandrecurrentconnections,whereasLSM employscontinuousdynamicsandspikingneurons.Thisstudyfocusespredominantlyonthe ESNapproachbecauseofitssimplicityandrobusttheoreticalfoundation[33].eTh funda- mentalarchitectureofESNisdepictedinFig2andcomprisesfoursteps: 1. GenerateaninputweightW viaEq(1),reservoirweightW viaEq(4),andleakrate in 𝛼 ,scalingintherange [0,1],whichcontrolstheeffectofreservoirstatesattheprevious timesteptothenextreservoirstate.LetN andN denotethedimensionsoftheinput u r N ×N r u andreservoirvectors,respectively.W ∈R representsweightmatricesoftheinput in N ×N r r data,scalingintherange [–𝜎 ,𝜎 ].W∈R denotesweightmatricesoftheinternal neurons,whicharegeneratedviaEqs(2),(3)and(4). W = (2randomBinomial(N ,N )–1)𝜎 , (1) in r u W =random(N ,N ,𝜃 ), (2) 0 r r 𝜌 =max(|eigen(W )|), (3) 0 0 W =W (𝜌 /𝜌 ) (4) 0 0 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 7/ 30 ID: pone.0322717 — 2025/7/27 — page 8 — #8 PLOS ONE Low-cost computation for isolated sign language video recognition Here,randomBinomial(N ,N )representsarandomfunction,whichextractsa r u N ×N r u samplefromthebinomialdistributiontogenerateamatrix∈R . 𝜎 represents theinputscalinghyperparameter,whichcontrolstheinfluenceoftheinputinthe dynamicreservoir.random(N ,N ,𝜃 )representsasparserandomfunctionthatgen- r r eratesamatrixinacertaindimensiononthebasisofthereservoirdimensionN ×N r r andtheparameter𝜃 asaconnectivityvalue,whichrepresentsthepercentageofnonzero valuesinthereservoirthathasavalueintherangeof[0,1].𝜌 representsthespectral radiushyperparameter,whichdefinesthemaximumabsoluteeigenvalueofthereser- voirweightmatrix,andeigen(W )isafunctionforcalculatingeigenvaluesonthebasis ofarandommatrixthatisgeneratedviaEq(2). 2. ProcesstheinputUandcalculatethecorrespondingreservoiractivationstatesx . (t) Wedefinetheinputandreservoiractivationstatesin Eqs(5)and(6),respectively,as follows: N ×N t u U = [u ,u ,u ,...,u ]∈R , (5) (1) (2) (3) (N ) whereN representsthetimelengthoftheinputdata. x = (1–𝛼 )x +𝛼 func(Wx +W u ), (6) (t+Δt) (t) (t) in (t+Δt) whereu representstheinputdata,x representsthereservoirstate,t represents (t+Δt) (t) thediscretetime(1,2...,T),funcrepresentsanactivationfunction,whichtypicallyusesa hyperbolictangent. 3. ComputethelinearreadoutweightsW fromthereservoirusinglinearregression.In out thisstudy,weusedridgeregression,whichminimizestheerrorbetweenY ,thepre- (t) dictedlabelattimet,andtheactuallabelY ,asdefinedin Eq(7),whilepreventing target overtfi tingvia Eq(8). N ×N t y Y = [y ,y ,...,y ]∈R , (7) target target(1) target(2) target(N ) whereN representsthenumberofdimensionsofatargetvector. ⊤ –1 ⊤ W = (X X +𝛽 I) X Y , (8) out 1 target N ×N r r where𝛽 representstheregularizationcoefficient, I∈R representstheidentity N ×N t r matrix,andX∈R representsthereservoirstatevectorX = x ,x ,...,x . (1) (2) (N ) 4. eTh trainednetworkisusedonnewinputdata Uforcomputingthepredictedlabel N ×N N ×N t y r y Y∈R byutilizingthetrainedoutputweightsW ∈R ,whichcanbeformu- out latedbyusingEq(9). Y =XW (9) out GroupedESN GroupedESN[34],[35],and[36]comprisemorethanoneparallelreservoir,denotedasN , andasinglelinearreadoutservesasthedecoder,asillustratedinFig3.Thisapproachis introducedtoextractdiversefeaturesfromtimeseriesinputs,enhancingpredictionperfor- mancebyexpandingthereservoirstatespacetoaugmentitsrepresentationalcapabilities.eTh correspondingreservoirstatecanbecomputedviaEq(10)[34].InthegroupedESN,a PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 8/ 30 ID: pone.0322717 — 2025/7/27 — page 9 — #9 PLOS ONE Low-cost computation for isolated sign language video recognition Fig3.IllustrationofgroupedESNs. https://doi.org/10.1371/journal.pone.0322717.g003 constantleakrateisemployedtocalculatethereservoirstate,withindependentW andW in valuesforeachreservoir. p p p p p x = (1–𝛼 )x +𝛼 func(W x +W u ), (10) (t+Δt) (t+Δt) (t) (t) in whereprepresentstheindexofaparallelreservoir.W andW havethesamegenerationand in distributionasintheESN,asobtainedviaEqs(1)and(4). Reservoirstaterepresentation Inthisstudy,wedrewinspirationfromtheESNimplementationproposedbyBianchietal.[23]. Intheirimplementation,theyuseddropparameters𝛿 ,whichareusedtosetthelengthofthe timestepthatwillbeprocessedinthetrainingbydroppingacertainreservoirstatetimestep, asformulatedinEq(11).eTh 𝛿 parameterisusefulinomittingtimestepsthatdonotsignif- icantlycontributetotherecognitionprocess.eTh resultofthedroppingtimestepisdenoted PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 9/ 30 ID: pone.0322717 — 2025/7/27 — page 10 — #10 PLOS ONE Low-cost computation for isolated sign language video recognition N ×N asX ∈R ,whereN isthenumberoftimestepsaerft thedropprocessonthedrop drop d value𝛿 . X =X[0 ∶N –𝛿 ,0 ∶N ], (11) drop t r whereintheformula,thenotation [0 ∶N ]isdefinedasasliceofarangestartingfromzero andendingatN –1. WealsoadoptthereservoirstaterepresentationmoduleshowninEq(13),whichisrep- resentedbys.Thismoduleutilizesallreservoirdynamics,incontrasttothestandardESN approach,whichemploysthefinalreservoirstatebecausetheutilizationofthefinalstate mayintroducebiasintheoutputmodelingspace.eTh otherobjectiveofthismoduleisto increasethegeneralizationcapacityofreservoirsthatrelyonheterogeneousdynamicsaris- ingfrominputs.Bianchietal.[23]developedanewmodelspaceinwhicheachmultivariate timeseriesisrepresentedbylinearmodelparameters.eTh linearmodelistrainedtopredict thesubsequentreservoirstatedenotedasx byemployingthemathematicalEq(12).sis (t+1) avectoroflengthN ,whereN isequaltothenumberofrows (N +1)N .eTh notation rep rep r r ⊤ ⊤ ⊤ V =Concat(V ,v ) representsamatrixresultingfromtheconcatenationresultofaweight 1 2 N ×N N r r r matrixV ∈R andvectorv ∈R .V,representedbyEq(17),denotestheoutcomeof 1 2 (N –1)×(N +1) (N –1)×N r r d d theridgeregressionofX ∈R onEq(15),whereX ∈R onEq(16) 2 next servesasthetarget.X isformedbyconcatenatingX inEq(14)withonethatisbiasedfor 2 prev theinput.v servesasabiastoadjusttheregressionlinetotfi thedata. V inEq(17)andW 2 out inEq(8)havedierff entpurposes,despitebothequationsutilizingridgeregressionintheir process.Eq(17)isemployedtouseallofthereservoirsbytrainingalinearmodeltopredict thesubsequentstateofthereservoirineachtimestep.Bycontrast,Eq(8)isusedtotrainthe modeltopredicttheoutputsofgiventasks. x =x V +v , (12) 1 2 (t+1) (t) s =vec(V) =Concat(vec(V ),v ), (13) 1 2 where X =X [0 ∶N –1,0 ∶N ], (14) prev drop d r X =Concat(X ,1), (15) 2 prev X =X [1 ∶N ,0 ∶N ], (16) next drop d r ⊤ –1 ⊤ V = (X X +𝛽 I ) X X , (17) 2 2 2 next 2 2 whereConcat(.)istheconcatenationfunctionusedtojoinasequenceofarrayswiththesame shapealonganexistingaxis.eTh vectorizationfunction,designatedas vec(.),isemployedto transformamatrixintoacolumnvector,wherebythecolumnsofthematrixarestackedin averticalconfiguration. 𝛽 istheregularizationparameterforridgeregression,andI isthe 2 2 identitymatrix. eTh utilizationof sintheplaceofthestandardreservoirstaterequiresthemodificationof N ×N N×N rep y y ̂ ̂ thereadoutdesignatedasW∈R andthepredictedlabeldesignatedasY∈R ,as demonstratedinEqs(18)and(19). ⊤ –1 ⊤ W = (S S +𝛽 I ) S Y , (18) 3 3 rep ̂ ̂ Y =SW, (19) PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 10/ 30 ID: pone.0322717 — 2025/7/27 — page 11 — #11 PLOS ONE Low-cost computation for isolated sign language video recognition N×N rep whereS∈R representsS = s ,s ,...,s ,whereN isthenumberofdata.𝛽 rep- (1) (2) (N) N ×N rep rep resentstheregularizationcoefficient, I ∈R representstheidentitymatrix,and N×N Y ∈R isthetargetmatrix. rep Research method Dataacquisition Thisstudyemployssignlanguagevideosasinputdata.Subsequently,MediaPipeisemployed toextractkeypointsfromthevideodatasetforeachframe.eTh extractedkeypointsencom- passthebody,lefthand,andrighthand,collectivelyamountingto150features.Morepre- cisely,66featurespertaintothebody,and42featureseacharededicatedtotheleftandright hands.eTh datasetutilizedinthisstudyisWLASL100,encompassing100distinctlabels. Processingeachvideoframe eTh processingofeachframeinvolvesatwo-stepprocedure:preprocessingandextracting keypointsthroughtheutilizationofMediaPipe.Datapreprocessingplaysapivotalrolein thisresearch,asvariationsinthevideodatasetconditionscanimpacttheaccuracyofthe classificationalgorithm.Toaddressthis,apreprocessingtechnique,namely,normalization andzeropadding,isemployed.Normalizationplaysacrucialroleinaccommodatingthe diversepositionsofsigners,usingthenosepositionasareferenceforeachsigner.eTh pro- cessinvolvesseveralsteps.Initially,thenoseisdetectedasareferencepointlocatedatindex 0intheposelandmark,asillustratedinFig4.Iftheposeisnotdetectedincertainframes, thoseframesaresubsequentlyremoved.eTh noseischosenasareferencebecauseitspoint Fig4.Illustrationofposelandmarks. https://doi.org/10.1371/journal.pone.0322717.g004 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 11/ 30 ID: pone.0322717 — 2025/7/27 — page 12 — #12 PLOS ONE Low-cost computation for isolated sign language video recognition isrelativelystableandnotaeff ctedbyhandmovement,andthispointisappropriatewhen theheadisstable.eTh nextstepinvolvesmappingthekeypointsintoimagecoordinates, followedbysubtractingallkeypointsbythenosecoordinate,termedthedistancekeypoint dKeypoint ∈ R ,asexpressedinEq(20).eTh meanofthe dKeypoint issubsequentlycom- puted,resultinginmeanKeypoint ∈ R ,asdemonstratedinEq(21).Thisvalueisthensub- tractedfromdKeypoint viaEq(23).Inthefinalstep,asper Eq(22),thenormalizationresult u ∈R isobtainedbydividingmeanKeypoint byitsstandarddeviation,computed normalized throughEq(24). dKeypoint =allKeypoint–nosePosition, (20) meanKeypoint =dKeypoint–dKeypoint, (21) u =meanKeypoint/std(meanKeypoint), (22) normalized where input =∑(input)/N, (23) Á À std(input) = ∑(input –input) , (24) N–1 i=1 N representsanumberofinputs,andu representsonetimestepthatwillbecom- normalized N ×N binedforalltimestepsfromonevideotobecomeU ∈ R . normalized Thisstudyalsoexploredanalternativenormalizationapproachusingtheshoulderposi- tionasthereferencepoint.eTh shoulderischosenasareferencepointbecausesignlan- guageprimarilyinvolvestheupperbodyandhandsothatitcanensurehandposition alignmentforSLR.eTh normalizationprocessisperformedbycomputingthecenterpoint oftheshouldersviaEq(25).eTh lengthoftheshoulderisthencalculatedvia Eq(26).In thefinalstep, allKeypoint,whichcombineshandandposelandmarks,isnormalizedvia Eq(27). leftShoulder +rightShoulder middle = (25) leftShoulderandrightShoulderrepresentthexandycoordinatesoftheleftandright shoulderpositions,respectively. lScale =||leftShoulder–rightShoulder|| (26) ||.||denotesthenormorabsolutefunction. allKeypoint–middle sNormalize = (27) lScale Byintroducinganotherpreprocessingtechnique,zeropadding,denotedasU ∈ padding N +padding×N R ,isperformedsubsequenttonormalization.Thisstepisimplementedtostan- dardizethelengthofthevideotimestepsacrossdatasets,ensuringuniformityintempo- raldimensions.Bothnormalizationandzeropaddingareintegralcomponentsofboth thetrainingandtestingprocesses.Inadditiontothesetechniques,anextrapreprocessing step,exclusivelyemployedduringtraining,isincorporated,termedaugmentation.Aug- mentationiscrucialinaddressingspecificchallengesencounteredinsignlanguagevideos, PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 12/ 30 ID: pone.0322717 — 2025/7/27 — page 13 — #13 PLOS ONE Low-cost computation for isolated sign language video recognition wheresignerspredominantlyemployeithertheleftorrighthand.Tomitigatethisbias,hor- izontalflippingisappliedinthisstudy.Bydoingso,theclassificationalgorithmisadeptat learningandadaptingtoscenarioswherethesignerpredominantlyuseseithertheleftor righthand. Proposedmethods Thisstudyintroducesanovelapproach,termedMRC,thatintegratesMediaPipeintotheSLR pipeline,asillustratedinFig5.PrecedingtheRCprocessingstep,featurenormalizationand zeropaddingareexecuted,involvingthecalculationsoutlinedinEqs(20),(21),and(22). eTh preprocessedfeaturesarethenfedintotheMRC,asdepictedin Fig6(a),employing distinctleakrates𝛼 foreachreservoir.eTh parallelreservoirs,denotedbytheindexrepre- sentationp,calculatethereservoirstateviaEq(28).eTh influenceofthepreviousstateon thecurrentstatevariesonthebasisoftheleakrate;alowerrateimpliesamoresignificant influence,whereasahigherrateresultsinlessimpact.Thisdiversificationinreservoirchar- acteristicswithintheMRCfacilitatestheextractionofdistinctsigningspeeds,contribut- ingtoaricherdatarepresentationthanaconventionalRC.eTh reservoirstatesfromall thereservoirsintheMRCareaggregated,andtheresultingrepresentationisfurtherpro- cessedthroughEq(13).Subsequently,linearregressionisappliedfortrainingorinference viaEq(18). Algorithm1presentsthepseudocodefortrainingtheMRC,whereasAlgorithm2outlines thepseudocodeforinference.Throughoutthetrainingandinferenceprocesses,variousfunc- tionscomeintoplay.Specifically, generateInternalWeight(.)isutilizedtogenerateW,asillus- tratedinEq(4).Additionally,thefunctiongenerateInputWeight(.)isemployedtocreateW in followingEq(1).eTh function reservoirState(.)isinvokedtocalculatethereservoirstate,as indicatedinEq(28).Furthermore,thefunctions(.)isemployedforcomputingthereservoir representation,asdepictedinEq(13).eTh function TrainRegressionisutilizedtotrainthe reservoirweight,followingEq(18). p p p p p p p x = (1–𝛼 )x +𝛼 func(W x +W u ) (28) (t+Δt) in (t+Δt) (t) (t) eTh weightsgeneratedinthetrainingprocessoutlinedin Algorithm1aresubsequently employedtopredictthelabelsY ofthetestdata,asdetailedinAlgorithm2.Thisprocess involvesutilizingtheloadTrainingInternalWeight()functionforW ,loadTrainingIn- in putWeight()forW,andthereadoutweightW. Experiments Experimentalsetting eTh SLRexperimentwasconductedusingPythonversion3.10onapersonalcomputerfeatur- inganIntelCorei7centralprocessingunit(CPU),32GBofrandomaccessmemory(RAM) anda12GBNVIDIAGeForceRTX4070Tigraphicsprocessingunit(GPU).eTh WLASL100 datasetwaspartitionedintothreesegments,training,validation,andtesting,comprising1780 videos,258videos,and258videos,respectively. eTh proposedMRCencompassestwodistinctarchitecturalconfigurations,eachcompris- ing300and510reservoirnodes.eTh aforementionedarchitecturesarecomposedofeither twoorthreeparallelreservoirs.eTh leakageratesappliedineachreservoirvarytoenhance temporalfeatureextraction.eTh valuesaresetat0.9forthefirstreservoir,0.8forthesecond reservoir,and0.6forthethirdreservoirinthethree-reservoirconfiguration.Furthermore, PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 13/ 30 ID: pone.0322717 — 2025/7/27 — page 14 — #14 PLOS ONE Low-cost computation for isolated sign language video recognition Fig5. Signlanguagerecognitionpipeline. https://doi.org/10.1371/journal.pone.0322717.g005 aparameterof0.3isassignedforthespectralradius𝜌 ,whichdeterminesthelargestvalue oftheabsoluteeigenvalueofthereservoir.Otherkeyparametersincludevfi eforthenum- berofreservoirstatestobedropped𝛿 ,0.2fortheconnectivityvalue𝜃 ,and𝛽 (15forV in Eq(17)))andregularizationcoefficientsof 𝛽 (3forWinEq(18)).Bothcoefficients utilizetheridgeregressionalgorithm.esTh evaluesareobtainedfromahyperparameter optimizationframework,Optuna[37].eTh searchspaceforeachhyperparameterisshownin Table2. eTh hyperparameterimportanceanalysisinFig 7showstheaverageresultoftheOptuna hyperparameterimportancevaluesduringfine-tuningfrom10optimizationrunsand30tri- alsforeachrun.eTh optimizationrunsrevealthatw_ridge_embedding( 𝛽 )hasthemostsig- nificantimpactonmodelperformance,indicatingthatcontrolling 𝛽 intrainingiscrucialfor improvinggeneralization.Similarly,thespectralradius𝜌 contributesalmostequally,suggest- ingthatbothparametersplayakeyroleinmodelstabilityandfeaturetransformation.eTh leakparameters𝛼 (leakrates1(𝛼 ),2(𝛼 ),and3(𝛼 )playasignificantyetsecondaryrole, 1 2 3 indicatingthatfine-tuningthemcouldoptimizememoryandstatepropagationinreservoir computing.Moreover,inputscaling(𝜎 )hasanoticeablebutlowerinfluence,meaningthat itaeff ctsmodelsensitivitybutisnotascriticalastheotherparameters.Ontheotherhand, PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 14/ 30 ID: pone.0322717 — 2025/7/27 — page 15 — #15 PLOS ONE Low-cost computation for isolated sign language video recognition Fig6. a)Reservoircomputingbasedonmultiplereservoirs.b)IllustrationofmatrixsizeonMRC510. https://doi.org/10.1371/journal.pone.0322717.g006 w_ridge(𝛽 ),thedropreservoir(𝛿 ),andconnectivity(𝜃 )haveminimalimpacts,suggesting thattheirtuningislesscriticalandthatdefaultvaluesmaybesufficient. InFig6(b),theworst-caseexperimentalscenarioformatrixoperationsinthisresearchis illustrated.eTh MRC170*3(MRC510)configuration,comprisingthreereservoirswith170 nodeseach,resultsinatotalof510nodesinthereservoir.Here,N representsthenumberof matrixsamples,N denotesthenumberoftimesteps,N representsthenumberoffeatures, t f N representsthenumberofparallelreservoirs,andN representsthenumberoflabels.eTh p y matrixsizeontheMRCiscomparabletothatonthestandardRC,involvingthreematrix multiplicationprocessesinthereservoirstatelayer,reservoirstaterepresentation,andread- outlayer,allofwhichemploylinearregression.Followingthereservoirstatelayer,atimestep reductionfrom203–198occursbecausethe𝛿 valueissettovfi e. eTh ESNandgroupedESNdifferfromMRCprimarilyinonehyperparameter.ESNshares thesameleakrateandasinglereservoir,mirroringgroupedESN.Toalignthereservoirnodes withtheMRCandgroupedESN,weestablishreservoirsizesof300and510fortheESN. Conversely,groupedESNmaintainsthesameleakratebutfeaturestwoandthreereservoirs, akintoMRC.WedeterminetheoptimalleakagerateforgroupedESNtobe0.9. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 15/ 30 ID: pone.0322717 — 2025/7/27 — page 16 — #16 PLOS ONE Low-cost computation for isolated sign language video recognition Algorithm1.TrainingprocessofMRC Input Input data matrix U, input data on the t timestep u , target data matrix Y , internal unit number of reservoir (t) rep N , number of parallel reservoirs N , leaking rate for each r p reservoir 𝛼 , spectral radius 𝜌 , connectivity 𝜃 , input scaling 𝜎 , weight matrices of the internal neurons W, weight matrices of input data W , time length of input data N , and the number in t of reservoir states to be dropped 𝛿 Output decoding module W 1: for p = 1 to N do 2: W[p] =generateInternalWeight(N ,𝜌 ,𝜃 ) 3: W [p] =generateInputWeight(U,N ,𝜎 ) in r 4: end for 5: for p = 1 to N do 6: for t = 0 to N –1 do p p 7: x =reservoirState(𝛼 ,W[p],x ,W [p],u ) p in (t+Δt) (t+Δt) (t) 8: end for p p p 9: X =Concat(x ,x ,...,x ) (t) (t+1) (N –1) 10: X [p] =X[∶N –𝛿 , ∶N ] drop t r 11: if p =1 then 12: allX =X [p] drop 13: else 14: allX =ColumnStack(allX,X [p]) drop 15: end if 16: end for 17: S =Concat(s(allX[0]),...,s(allX[N])) 18: W = TrainRegression(S,Y ) rep 19: return W eTh proposedmethodunderwentacomparativeanalysiswithtwodeeplearning approaches:thebidirectionalgatedrecurrentunit(BiGRU)andone-dimensionalconvolu- tion(Conv1D)combinedwiththeBiGRU,denotedasConv1D+BiGRU.eTh selectionofthe BiGRUasabenchmarkalgorithmisgroundedincompellingfindingsfromSubramanian’s research[12].eTh BiGRUarchitectureencompassesninelayers,featuringthreeGRUlayers, onebatchnormalizationlayer,twodropoutswithratiosof0.2and0.3,andthreedenselayers. –4 eTh trainingwasconductedover150epochswithalearningrateof10 ,utilizingAdamopti- mizationwithexponentialdecayratesof0.9and0.999.eTh BiGRUarchitectureisvisually depictedinFig8.Fig9illustratestheConv1D+BiGRUlayer,whichisabsentintheBiGRU architecture.eTh inclusionofConv1Dismotivatedbythetemporalnatureofthedata,which areorganizedastimeserieswitheachrowcorrespondingtoatimestep.eTh outputshapesfor eachlayerinthearchitecturesaredisplayedinbothfigures.eTh dimensions N,N ,andN rep- t f resentthenumberofsamples,timesteps,andfeatures,respectively.Notably,theBiGRU3(64) layeroutputsatwo-dimensionalshapebecausethenetworkreturnsthefinalcellstatewithout theinputsequence.Thisfinalstateiscomprehensiveinfeatures,facilitatinglabelprediction fromtheinputdata. Inaccordancewiththeaforementionedexperimentalsetup,theachievedaccuracyover 150epochsisdepictedinFigs10and11.BoththeBiGRUandConv1D+BiGRUexhibita continualimprovementinaccuracyonbothtrainingandvalidationdatathroughoutthe PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 16/ 30 ID: pone.0322717 — 2025/7/27 — page 17 — #17 PLOS ONE Low-cost computation for isolated sign language video recognition Algorithm2.InferenceprocessofMRC Input input data matrix U, input data on the t timestep u , internal unit number of reservoir N , number of paral- (t) r lel reservoirs N , leaking rate for each reservoir 𝛼 , weight p p matrices of the internal neurons W, weight matrices of input data W , time length of input data N , trained output weights in t W, and the number of reservoirs state to be dropped 𝛿 Output Prediction label Y 1: W =loadTrainingInternalWeight() 2: W =loadTrainingInputWeight() in 3: W =loadTrainingOuputtWeight() 4: for p = 1 to N do 5: for t = 0 to N –1 do p p 6: x =reservoirState(𝛼 ,W[p],x ,W [p],u ) p in (t+Δt) (t+Δt) (t) 7: end for p p p 8: X =Concat(x ,x ,...,x ) (t) (t+1) (N –1) 9: X [p] =X[∶N –𝛿 , ∶N ] drop t r 10: if p =1 then 11: allX =X [p] drop 12: else 13: allX =ColumnStack(allX,X [p]) drop 14: end if 15: end for 16: S =Concat(s(allX[0]),...,s(allX[N])) ̂ ̂ 17: Y = SW 18: return Y Table2.Hyperparametervaluerangesearchspace. Hyperparameter Symbol Value Leakrate 𝛼 0.1to1 Spectralradius 𝜌 0.1to1 Connectivity 𝜃 0.1to1 Reservoirstaterepresentationregularizationcoefficient(w_ridge) 𝛽 1to30 Readoutregularizationcoefficient(w_ridge_embedding) 𝛽 1to30 Dropreservoir 𝛿 1to10 Inputscaling 𝜎 0.1to1 https://doi.org/10.1371/journal.pone.0322717.t002 epochs,indicatingeffectivelearningfromthedataset.Notably,anin-depthanalysisreveals that,evenbeforecompletingthe150epochs,bothalgorithmsdemonstratesuperiorperfor- mance.Inlightofthisobservation,themodel’soptimalaccuracyisselectedasthecriterion forpredictingtestdatainthisstudy.Moreover,thereservoiralgorithm’sprocessingisnotably morestraightforwardthanthatofdeeplearningalgorithms.Inthisalgorithm,onlythefinal layer,referredtoasthereadoutlayer,undergoesweightupdatesviaEq(18).Importantly,the trainingofthereservoiralgorithmisaone-timeprocess. eTh experimentalscenariosaredividedintothreeparts.First,asensitivityanalysisofthe leakrateoptimizedwithOptunawasperformed.AcomparisonoftheSLRperformanceof thedeepRNNandESN-basedalgorithmswasthencarriedoutonthreetypesofextracted features.eTh firsttypeoffeaturewasextractedwithoutnormalization.eTh secondtypeof PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 17/ 30 ID: pone.0322717 — 2025/7/27 — page 18 — #18 PLOS ONE Low-cost computation for isolated sign language video recognition Fig7.Importanceofhyperparameter. https://doi.org/10.1371/journal.pone.0322717.g007 featurewasnormalizedbasedontheshoulderasareferencepoint.eTh thirdtypeoffeature wasnormalizedbasedonthenoseasareferencepoint.Inthethirdscenario,theoptimal resultsfromthesecondscenariowereselectedandthencomparedwiththoseoftheexisting SLRalgorithm. Experimentalresults eTh sensitivityanalysisconductedinthisstudyaimedtovalidatetheleakratevaluessug- gestedbyOptuna.Inthisscenario,thefeatureusedwasanextractedfeaturewithoutnormal- ization.eTh resultsarepresentedin Fig12,wheretheaccuracyvariationacrossdifferentleak ratescanbeobserved.eTh figureclearlyshowsthattheaccuracydifferencesacrossvarious leakrateswerenotsubstantial,indicatingthatthemodelremainsrelativelystablewithinthe testedrange.Optuna-suggestedleakratesof0.9,0.8,and0.6,whichachievedaccuraciesof 42.17%,41.98%,and42.33%,respectively.eTh highestrecordedaccuracywas42.44%ataleak rateof0.5,showinga0.27%differencefromtheOptuna-selected0.9leakrate. esTh eresultssuggestthatOptuna’sselectionisreasonableandfallswithinastableregion. However,thehighestaccuracydidnotoccurattheexactOptuna-suggestedvalues,indicating thatslightadjustmentsintheleakratemayfurtherenhanceperformance.Giventheminor ucfl tuationsinaccuracy(allwithin1.24%ofthepeakvalue),itcanbeconcludedthatthe modelisnothighlysensitivetovariationsintheleakratewithinthisrange. eTh secondexperimentalscenariowasconcernedwithacomparisonoftheaccuracyof SLRfromdeepRNNandESN-basedalgorithms.Asummaryoftheexperimentalresults ispresentedinTable3,whichshowstherecognitionperformancewithoutnormalization. Additionally,Tables4and 5displaytherecognitionperformancevianormalizationwithnose andshoulderasreferencepoints.eTh normalizationiscomputedvia Eqs(22)and(27)for nosenormalizationandshouldernormalization,respectively.Inthesetables,Accrefersto accuracy,andSDindicatesthestandarddeviation.eTh averagetrainingandinferencetimes PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 18/ 30 ID: pone.0322717 — 2025/7/27 — page 19 — #19 PLOS ONE Low-cost computation for isolated sign language video recognition Fig8.ArchitectureofBiGRU. https://doi.org/10.1371/journal.pone.0322717.g008 arerepresentedinmm:ss.ms,whichmeansminutes,seconds,andmicroseconds.eTh impact ofnosenormalizationisvisuallydepictedinFig13.eTh normalizationprocessinvolves shiftingbasedonthenosepositionandscalingoftheoriginalkeypoints,asillustratedin Figs.13(b)and13(e).esTh eimagesrevealdistinctdistributionsofkeypointsduetovariations insignerpositionsandpostures.Followingnormalization,thekeypointdistributionsbecome comparable,asevidentinFigs.13(c)and 13(f). eTh experimentalresultsrevealedthatnormalizingsignificantlyimprovedtherecognition accuracyacrossallthemodels.FromTables4and5,nose-basednormalizationoutperforms shoulder-basednormalization.Forexample,MRC100*3achieved44.81%accuracywithout PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 19/ 30 ID: pone.0322717 — 2025/7/27 — page 20 — #20 PLOS ONE Low-cost computation for isolated sign language video recognition Fig9.ArchitectureofConv1D+BiGRU. https://doi.org/10.1371/journal.pone.0322717.g009 normalization,56.43%accuracywithshouldernormalization,and60.35%accuracywithnose normalization,reflectinganimprovementofapproximately15.54points.Similarly,BiGRU’s accuracyincreasesfrom35.74%withoutnormalizationto46.94%withshouldernormaliza- tionand50.36%withnosenormalization,whereasConv1D+BiGRUimprovesfrom29.65% to40.54%withshouldernormalizationand46.59%withnosenormalization.Thissuggests PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 20/ 30 ID: pone.0322717 — 2025/7/27 — page 21 — #21 PLOS ONE Low-cost computation for isolated sign language video recognition Fig10.Trainingaccuracy. https://doi.org/10.1371/journal.pone.0322717.g010 Fig11.Validationaccuracy. https://doi.org/10.1371/journal.pone.0322717.g011 thatnormalizationenhancesthespatialrepresentation,enablingmodelstobettercapture thedynamicpatternsofsignlanguagegestures.Guidedbythesefindings,normalizationwas employedinsubsequentexperimentstooptimizemodelperformance. Fiveiterationswereusedintheexperiments,withtheaimofscrutinizingthestandard deviation(SD)ofeachalgorithm.eTh SDservesasametrictogaugethevariabilityinaccu- racyvaluesobtainedduringtheexperiments,withlowervaluesbeingpreferable.Forthe deeplearningalgorithm,150epochswereemployed.eTh accuracyineachtabledepictsthe averageaccuracyattainedbythealgorithmacrossvfi etrainingandtestingsessionswiththe PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 21/ 30 ID: pone.0322717 — 2025/7/27 — page 22 — #22 PLOS ONE Low-cost computation for isolated sign language video recognition Fig12.ImpactoftheleakrateontheSLRaccuracy. https://doi.org/10.1371/journal.pone.0322717.g012 Table3.Comparisonofrecognitionperformancewithoutnormalization. Method Acc±SD(%) AverageTrainingTime AverageInferenceTime (mm:ss.ms) (mm:ss.ms) BiGRU 35.74±3.38 33:59.9 00:00.1 Conv1D+BiGRU 29.65±2.72 35:27.6 00:00.1 ESN300reservoir 42.21±1.02 00:58.9 00:06.2 ESN510reservoir 46.16±0.48 02:11.3 00:09.4 groupedESN[34]150*2reservoir 42.25±0.49 00:54.4 00:05.5 groupedESN[34]255*2reservoir 45.66±0.40 01:56.5 00:09.2 groupedESN[34]100*3reservoir 42.48±1.42 00:57.9 00:05.6 groupedESN[34]170*3reservoir 46.71±0.53 01:53.9 00:09.3 MRC150∗2reservoir 42.56±1.43 00:54.8 00:05.5 MRC255∗2reservoir 46.55±0.63 02:04.6 00:09.6 MRC100∗3reservoir 44.81±0.87 00:54.7 00:05.4 MRC170∗3reservoir 47.64±0.75 02:00.2 00:10.7 https://doi.org/10.1371/journal.pone.0322717.t003 best-performingmodelfromeachsession.Notably,inthecaseofRC,thelastweightisuti- lized,asupdatesoccuratthefinallayervia Eq(18). Amongthevariousconfigurationstested,theMRCexhibitedthehighestaccuracywith 300reservoirnodes,whichisthreeparallelreservoirswith100nodes,achievinganotable 60.35%,coupledwithacommendablylowSDof1.52%,asdetailedinTable5.Notably,MRC exhibitedsuperioraccuracycomparedwithitsdeeplearningcounterparts,particularlythe PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 22/ 30 ID: pone.0322717 — 2025/7/27 — page 23 — #23 PLOS ONE Low-cost computation for isolated sign language video recognition Table4.ComparisonofSLRperformanceusingnormalizationwithbothshouldersasreferencepoints. Method Acc±SD(%) AverageTrainingTime AverageInferenceTime (mm:ss.ms) (mm:ss.ms) BiGRU 46.94±2.15 36:44.5 00:00.1 Conv1D+BiGRU 40.54±1.14 39:17.9 00:00.2 ESN300reservoir 56.67±0.50 01:17.8 00:06.4 ESN510reservoir 56.01±1.27 02:05.0 00:09.3 groupedESN[34]150*2reservoir 56.82±1.32 01:12.7 00:06.4 groupedESN[34]255*2reservoir 54.96±1.19 02:00.6 00:09.3 groupedESN[34]100*3reservoir 56.09±1.71 01:14.0 00:06.3 groupedESN[34]170*3reservoir 55.50±0.58 02:01.4 00:09.8 MRC150∗2reservoir 55.97±0.94 01:12.0 00:05.9 MRC255∗2reservoir 55.31±1.29 02:03.0 00:09.0 MRC100∗3reservoir 56.43±0.83 01:15.7 00:05.9 MRC170∗3reservoir 55.58±0.33 01:58.2 00:09.1 https://doi.org/10.1371/journal.pone.0322717.t004 Table5.ComparisonofSLRperformanceusingnormalizationwithanoseasareferencepoint. Method Acc±SD(%) AverageTrainingTime AverageInferenceTime (mm:ss.ms) (mm:ss.ms) BiGRU 50.36±1.41 33:54.1 00:00.1 Conv1D+BiGRU 46.59±2.53 35:28.1 00:00.1 ESN300reservoir 58.64±1.56 00:58.8 00:05.1 ESN510reservoir 58.29±1.21 02:23.1 00:09.9 groupedESN[34]150*2reservoir 58.45±1.31 00:52.6 00:05.5 groupedESN[34]255*2reservoir 58.26±0.84 02:06.1 00:09.1 groupedESN[34]100*3reservoir 58.68±1.13 00:53.5 00:05.2 groupedESN[34]170*3reservoir 58.45±0.73 01:59.5 00:09.4 MRC150∗2reservoir 59.42±1.09 00:53.4 00:04.9 MRC255∗2reservoir 59.42±1.27 02:04.1 00:09.0 MRC100∗3reservoir 60.35±1.52 00:52.7 00:05.2 MRC170∗3reservoir 58.37±1.18 02:01.8 00:09.4 https://doi.org/10.1371/journal.pone.0322717.t005 BiGRUandConv1D+BiGRU.UponscrutinizingMRC’saccuracyagainstESNandgrou- pedESN,MRCconsistentlydemonstratedsuperiorperformance,asexemplifiedbyMRC300 andMRC510.Forexample,theMRC100*3configurationachievedanaccuracythatwas1.71 pointshigherthanthatofESN300,1.9pointshigherthanthatofgroupedESN150*2and 1.67pointshigherthanthatofgroupedESN150*3.However,notably,inoneinstance,the MRC170*3configurationdidnotoutperformthegroupedESN170*3configuration,although itdidexceedboththegroupedESN255*2andESN510configurations.Overall,thearrange- mentof300reservoirnodesbeats510nodesviaanidenticalapproach.Thisemphasizes theimportanceofselectingthenumberofreservoirnodesforanESN-basedmodel.Larger reservoirsizesdonotnecessarilyguaranteesuperiorperformanceinESN-basedmodels. Havingtoomanynodescannegativelyimpacttheabilityofthemodeltoeffectively distinguishbetweenfeatures. Significantdiscrepanciesintrainingtimesin Table5wereobservedbetweentheESN, MRC,andgroupedESNapproachescomparedwiththedeeplearningmethod.eTh BiGRU andConv1D+BiGRUmodelstook33:54.1and35:28.1minutes,respectively,whereasthe fastestESN-basedmodel,suchasMRC100*3,completedtrainingin0:52.7seconds.This demonstratestheadvantageoftheESN-basedmethodintermsofcomputationalefficiency duringtraining.Notably,theESN,MRC,andgroupedESNexhibitedcomparabletraining PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 23/ 30 ID: pone.0322717 — 2025/7/27 — page 24 — #24 PLOS ONE Low-cost computation for isolated sign language video recognition Fig13. Illustrationof (a)Asingleframefrom“accident”sign,(b)Aplotof“accident”keypointwithoutnormalization,(c)Aplotof“accident”keypointaeft r normalization,(d)Asingleframefrom“apple”sign,(e)aplotof“apple”keypointwithoutnormalization,and(f)Aplotof“apple”keypointaeft rnormalization. https://doi.org/10.1371/journal.pone.0322717.g013 timeswhenequivalentreservoirsizeswereemployed.Forexample,ESN510finishedtrain- ingat2:23.1minutes,whereasGroupedESN255*2required2:06.1minutes,andMRC170*3 achieved2:01.8minutes,indicatingthattheparallelreservoirdidnotincreasethetraining time. Furthermore,allalgorithms,includingdeeplearning,achievedremarkablyfastpro- cessingtimes,therebydemonstratingtheirpotentialforreal-timeapplicationsinSLR. BoththeBiGRUandConv1D+BiGRUhadnegligibleinferencetimesof00:00.1s,butthe ESN-basedmodelssuchasMRC100*3hadslightlygreaterinferencetimesbutstillhad efficientdurationsof00:05.2s.eTh inferencetimesacrosstheESN,groupedESN,andMRC werealllessthan10s. Overall,MRC100*3demonstratedthebestbalancebetweenperformanceandcompu- tationalefficiency,attainingthehighestaccuracywithaminimaltrainingperiodandrapid inferencetime.esTh efindingsrenderMRCidealfortasksthatnecessitaterapidmodel updatesandreal-timerecognition. Inthefinalscenario,acomparativeanalysiswasperformedbetweenourproposedmethod andexistingalgorithms,andalloftheseapproachesusedeeplearning.Table6presentsa comprehensiveoverviewoftherecognitionperformance,whereaccuracy(Acc)servesasthe PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 24/ 30 ID: pone.0322717 — 2025/7/27 — page 25 — #25 PLOS ONE Low-cost computation for isolated sign language video recognition metricforevaluatingcorrectnessindatasetrecognition,consideringtop-kaccuracy,includ- ingtop-1,top-5,andtop-10.eTh averagetrainingtimeisreportedinhours,minutes,seconds, andmicroseconds(hh:mm:ss.ms),whereastheinferencetimeisrecordedinminutes,sec- onds,andmicroseconds(mm:ss.ms).Additionally,the“Device”columnindicateswhether theprogramwasexecutedonaGPUoraCPU.eTh analysiswasconductedviatheavailable codefromLietal.[9]forouranalysis. eTh resultsdemonstratethatMRCachievescompetitiveperformancewhilesignificantly reducingtrainingtime.DespiteI3Dattainingthehighesttop-1accuracy,itcomesatthecost ofprolongedtrainingandinferencetimes,makingitcomputationallyexpensive.Bycontrast, MRCachievesthebesttop-5andtop-10accuracieswhentraininginlessthanoneminute, highlightingitsefficiency.Additionally,MRCistheonlyapproachthatoperatesentirelyon aCPU,makingitmoreaccessiblethanGPU-dependentmodels.Pose-TGCNachievessolid performancebutisslightlyoutperformedbyMRCintermsofthetop-5andtop-10accura- cies.Pose-GRUexhibitsloweraccuracythantheothermethods,whereasMOPGRUshows promisingperformancebutlackscompletebenchmarkingdata.esTh efindingssuggestthat MRCprovidesahighlyefficientandpracticalalternativeforsignlanguagerecognitiononthe WLASL100dataset. eTh algorithmsunderscrutinyincludePose-TGCN,Pose-GRU,I3D,MOPGRU,andMRC. I3Dachievedthehighesttop-1accuracy,withascoreof65.89%,followedbytheMOPGRU, whichachievedascoreof63.18%.OurproposedMRCsecuredthethird-highestaccuracy, reaching60.35%,whichsurpassedtheperformanceofboththePose-GRU(55.43%)and Pose-TGCN(46.51%).Furthermore,MRCachievedthebesttop-5(84.65%)andtop-10 (91.51%)accuracies,demonstratingitsrobustnessinrecognizingsignlanguagevariations.In particular,MRCachievedthiscompetitiveperformance,withasubstantiallyshortertrain- ingtimeof00:00:52.7minutesandaninferencetimeof00:05.2secondswhilerunningon aCPU.ThisunderscoresthecomputationalefficiencyofMRCincomparisonwithother GPU-dependentmodels,suchasI3D,whichrequiresmorethan20hoursoftraining.This highlightsthecompetitiveperformanceofMRCoverdeeplearningapproaches,whichareall achievedatanefficientcomputationalcost.AnotherkeyadvantageoftheMRCmodelisits abilitytorunonaCPU,asopposedtoothermodels,whichrequireGPUacceleration.This enablesMRCtobeimplementedinlow-powerandedgecomputingcontextswhilemaintain- ingreal-timeperformance. Discussion Inthesubsectionpresentingtheexperimentalresults,wepresentedaseriesofexperiments, includingsensitivityanalysis,normalization,andcomparisonswithstate-of-the-artalgo- rithms.Asensitivityanalysiswasperformedtovalidatethehyperparametersuggestionsfrom Optuna,andtheresultsconfirmedtheircorrectness.Giventheinherentvariabilityinthe Table6.AccuracycomparisonofdifferentapproachesonWLASL100. Method Acctop-1(%) Acctop-5(%) Acctop-10(%) TrainingTime InferenceTime Device hh:mm:ss.ms mm:ss.ms Pose-TGCN[9] 55.43 78.68 87.60 00:38:18.9 00:04.2 GPU Pose-GRU[9] 46.51 76.74 85.66 - - - I3D[9] 65.89 84.11 89.92 20:13:42.5 00:12.5 GPU MOPGRU[11] 63.18 - - - - - MRC 60.35 84.65 91.51 00:00:52.7 00:05.2 CPU https://doi.org/10.1371/journal.pone.0322717.t006 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 25/ 30 ID: pone.0322717 — 2025/7/27 — page 26 — #26 PLOS ONE Low-cost computation for isolated sign language video recognition signer’spositionandpostureacrossvideos,weunderscoretheimportanceofnormalization inSLRforenhancingaccuracy.eTh primaryobjectiveofnormalizationistomitigatediscrep- anciesinkeypointpositions,ensuringthattheyexistoncomparablescales,therebydimin- ishingtheimpactofsigner-specificvariationsinpositionsandpostures.esTh evariations, devoidofdistinctiveness,canpotentiallyaeff cttheaccuracyofSLRalgorithms.Inthisstudy, normalizationwascenteredaroundthenoseasareferencepoint,givenitsrelativestability. Additionally,forcomparison,wealsoappliednormalizationusingtheshouldersasarefer- encepoint.However,theresultsshowedthatnormalizationbasedonthenoseoutperformed theshoulder-basedapproach.Thismaybeduetotheinherentinstabilityoftheshoulder positioncomparedwiththatofthenosebecausethenoseisnotaeff ctedbyhandmovement, andtheheadofthesignerisrelativelystable.eTh experimentaloutcomerevealedperfor- manceenhancementsinallalgorithmsfollowingkeypointnormalization. Wepositedthataugmentingfeaturesandutilizingleakratescouldenhancetheefficacyof theESNalgorithm,aconjecturesupportedbythesuperiorperformanceexhibitedbyMRC overESN,groupedESN,andvariousdeeplearningalgorithms.Notably,thereservoirsizein theESN-basedalgorithmremainedconstantacrosstheexperiments.eTh principaldistinc- tionarosefromtheincorporationofdistinctleakageratesforeachreservoirwithinthemul- tireservoirstructureoftheMRC.Thisleakrategovernstheextenttowhichthepriorstate isretained,influencingthenetwork’scapacitytostoreinformation,asoutlinedin Eq(28). Ahigherleakrateimpliesadiminishedimpactfromhistoricalstates,allowingthemodelto prioritizenewinputs. OurexperimentalresultsdemonstratedthatMRCconsistentlyoutperformedESN-based models,especiallywhenthereservoirsizewassetto300nodes.Inoneinstance,ESN-based approacheswith510reservoirnodesexhibitedperformanceinferiortothatof300reser- voirnodes.Thisdiscrepancymightstemfromtheincreaseddifficultyindistinguishing moreextractedfeaturesandmisaligninghyperparametercombinations.eTh performanceof ESN-basedalgorithmsisintricatelytiedtovarioushyperparameters,includingsparsity,the reservoirspectralradius,inputweightscaling,andreadoutweightregularization.Thishigh- lightstheimportanceofcarefullytuninghyperparametersinESN-basedapproachestoavoid reducingthemodel’sabilitytogeneralize. AlltheMRCsandthetwoESN-basedalgorithmsexhibitfastertrainingtimesthantheir deeplearningcounterparts.Thisefficiencystemsfromtheinherentlysimplerlearningpro- cessembeddedinESN-basedalgorithms,asopposedtothedeeplearningalgorithm’suti- lizationofbackpropagation.IntheESN-basedparadigm,thelearningunfoldssolelyduring thereadoutphase,employingEq(18).Comparedwiththeirdeeplearningcounterparts,the linearmodelunderpinningtheoutputlayercontributestothelowercomputationaldemands ofESN-basedalgorithms.eTh expeditioustrainingtimeassumessignificanceinSLRforits potentialscalability,enablingthetrainingofmoreextensivedatasetswithinareasonable timeframe.Moreover,theacceleratedtrainingprocessallowsfortheimplementationof real-timeapplicationsbyexpeditingthedeploymentandenhancementofmodels. eTh inferencetimeofthealgorithmremainsconsistentlylessthan10seconds.Ingeneral, theinferencetimeofanESN-basedalgorithmwithanidenticalreservoirsizeshouldexhibit uniformity.However,inthisresearch,slightdisparitiesareobserved,likelyattributableto variationsincomputationalresources,suchasavailablememoryduringprogramexecution. Notably,theinferencetimeofthedeeplearningalgorithmsurpassesthatoftheESN-based model.Thisphenomenonispotentiallyattributedtothemoreefficientimplementationof thedeeplearningframeworkincomparisontothedevelopedESN.Uponscrutinizingthe processingmatricesofeachlayerinESN-basedalgorithmsanddeeplearning,asdepicted inFigs6(b),8,and9,adiscernibledifferenceemerges. Fig6(b)illustratestheoutputmatrix PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 26/ 30 ID: pone.0322717 — 2025/7/27 — page 27 — #27 PLOS ONE Low-cost computation for isolated sign language video recognition shapeoftheESN-basedalgorithminthisstudy,specificallyMRC510,whichisequivalentto ESN510andgroupedESN510.ThisfigureprovidesinsightintotheESN-basedalgorithm’s streamlinedprocessesforpredictinglabelscomparedwiththemoreintricatenatureofdeep learning.Forinstance,Fig6(b)showsthecomplexityofthedeeplearningapproach,which comprisesthreelayersofbidirectionalgatedrecurrentunits(BiGRU:BiGRU1,BiGRU2,and BiGRU3)housingnumerousBiGRUcells.EachBiGRUcell,inturn,encompassesfourinde- pendentlyfunctioninggates,operatingbothforwardandbackward.eTh findingthatESN processesfewermatricesthandeeplearningunderscorestheformer’sefficiencyindemanding fewercomputationalresourcesthanitsdeeplearningcounterpartdoes. Acomparisonwasconductedbetweentheproposedmethodandotherapproaches, includingPose-TGCN,Pose-GRU,I3D,andMOPGRU.Allofthecomparisonalgorithms employedadeeplearningarchitecturetodeveloptheSLRsystemandutilized2Dkeypoints extractedbyOpenPose[38]forTGCN,Pose-GRUandI3D,whereasMOPGRUemployed MediaPipe,whichissimilartotheproposedmethod.Inaddition,I3Dcombinesspatialand temporalfeatures.eTh proposedmethoddemonstratedcomparableperformancetothedeep learningapproach,whichachieved60.35%,outperformingPose-TGCNandPose-GRU.I3D achievedthehighesttop-1accuracyat65.89%becauseI3Dcontainshighmodelcapacity owingtoitshighnumberofparameters.erTh efore,I3DreliesonextensiveGPUtraining,and thetrainingtimeexceeds20honaGPU,whichlimitsitspracticalapplicabilityinthecon- textofdevicetraining.Bycontrast,MRCachievedthebestaccuracyinthetop-5(84.65%) andthetop10(91.51%),demonstratingitsabilitytocapturethefeaturesofsignlanguage. Ouralgorithmleveragedthedynamicsinthereservoirlayertorepresenttheinputfeature, andthemultiplereservoirmodelenabledtheextractionoffeatureswithgreatervariation thanastandardreservoir.eTh proposedmethod,MRC,hasbeenshowntoachieveabal- ancebetweenefficiencyandaccuracy.MRCdemonstratedthatitachievedatop-1accuracy of60.35%withatrainingtimeof52.7sonaCPU,thussubstantiatingitsfeasibilityforedge computing. Conclusions Inthisstudy,weexploredtheperformanceofESNsthroughthestandardESN,MRC,and groupedESNapproaches.eTh findingsofthisstudyindicatethattheproposedMRCmethod, whichincorporatesvariousleakrates,enhancesfeaturerepresentation,enablingthenetwork toacquireamoreprofoundunderstandingthanthestandardESN.Consequently,itdemon- stratescompetitiveperformancewhenjuxtaposedwithdeeplearningapproaches,achiev- ing60.35%top-1accuracy,84.65%top-5accuracy,and91.51%top-10accuracy.Moreover, MRChasefficiencyadvantages,requiringlesstrainingtimeandfewerresourcesthandeep learningdoes,whichisattributedtoitsstreamlinedprocessesandreducednumberofmatrix computationswithintheESN.ThisimpliesthefeasibilityofdeployingRCsonportable deviceswithconstrainedcomputationalresources,suchaslimitedRAMandprocessors. Althoughtheresultsarepromising,theyfallshortofachievingstate-of-the-artbench- marks.Futureresearcheffortswillfocusonrefiningaccuracybyemployingamodified ESNinconjunctionwithothermachinelearningmethods.Additionally,weaimtoimple- mentmultiplereservoirsonembeddedhardware,suchasfield-programmablegatearrays (FPGA)[39–41],andexplorephysicalRC.Thisapproachwillempoweruserstocarrythe systemportablyanddeployitasneeded. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 27/ 30 ID: pone.0322717 — 2025/7/27 — page 28 — #28 PLOS ONE Low-cost computation for isolated sign language video recognition Author contributions Conceptualization:ArieRachmadSyulistyo,YuichiroTanaka. Datacuration:ArieRachmadSyulistyo. Formalanalysis:ArieRachmadSyulistyo. Fundingacquisition:YuichiroTanaka,DindaPramanta,HakaruTamukoh. Methodology:ArieRachmadSyulistyo,YuichiroTanaka. Sowa ft re: ArieRachmadSyulistyo. Supervision:YuichiroTanaka,HakaruTamukoh. Validation:ArieRachmadSyulistyo. Visualization:ArieRachmadSyulistyo. Writing–originaldraft: ArieRachmadSyulistyo. Writing–review&editing:ArieRachmadSyulistyo,YuichiroTanaka,DindaPramanta, NinnartFuengfusin,HakaruTamukoh. References 1. Shah F, Shah MS, Akram W, Manzoor A, Mahmoud RO, Abdelminaam DS. Sign language recognition using multiple kernel learning: a case study of Pakistan sign language. IEEE Access. 2021;9:67548–58. https://doi.org/10.1109/access.2021.3077386 2. WHO. Deafness and hearing loss. https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss. 2024. 3. Kamal SM, Chen Y, Li S, Shi X, Zheng J. Technical approaches to Chinese sign language processing: a review. IEEE Access. 2019;7:96926–35. https://doi.org/10.1109/access.2019.2929174 4. Natarajan B, Rajalakshmi E, Elakkiya R, Kotecha K, Abraham A, Gabralla LA, et al. Development of an end-to-end deep learning framework for sign language recognition, translation, and video generation. IEEE Access. 2022;10:104358–74. https://doi.org/10.1109/access.2022.3210543 5. Al-Qurishi M, Khalid T, Souissi R. Deep learning for sign language recognition: current techniques, benchmarks, and open issues. IEEE Access. 2021;9:126917–51. https://doi.org/10.1109/access.2021.3110912 6. Bilge YC, Cinbis RG, Ikizler-Cinbis N. Towards zero-shot sign language recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(1):1217–32. https://doi.org/10.1109/TPAMI.2022.3143074 PMID: 7. Hua H, Li Y, Wang T, Dong N, Li W, Cao J. Edge computing with artificial intelligence: a machine learning perspective. ACM Comput Surv. 2023;55(9):1–35. https://doi.org/10.1145/3555802 8. Rastgoo R, Kiani K, Escalera S. Sign language recognition: a deep survey. Expert Syst Appl. 2021;164:113794. https://doi.org/10.1016/j.eswa.2020.113794 9. Li D, Rodriguez C, Yu X, Li H. Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: Proceedings on IEEE Winter Conference on Applications of Computer Vision. 2020, pp. 1459–69. 10. Subramanian B, Olimov B, Kim J. Fast convergence GRU model for sign language recognition. J Korea Multimedia Soc. 2022;25(9):1257–65. 11. Subramanian B, Olimov B, Naik SM, Kim S, Park K-H, Kim J. An integrated MediaPipe-optimized GRU model for Indian sign language recognition. Sci Rep. 2022;12(1):11964. https://doi.org/10.1038/s41598-022-15998-7 PMID: 35831393 12. Lugaresi C, Tang J, Nash H, McClanahan C, Uboweja E, Hays M, et al. MediaPipe: a framework for perceiving and processing reality. In: Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019; 2019. Available from: https: //mixedreality.cs.cornell.edu/s/NewTitle_May1_MediaPipe_CVPR_CV4ARVR_Workshop_2019.pdf. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 28/ 30 ID: pone.0322717 — 2025/7/27 — page 29 — #29 PLOS ONE Low-cost computation for isolated sign language video recognition 13. Abdelsattar M, Abdelmoety A, Ismeil MA, Emad-Eldeen A. Automated defect detection in solar cell images using deep learning algorithms. IEEE Access. 2025;13:4136–57. https://doi.org/10.1109/access.2024.3525183 14. Abdelsattar M, A Ismeil M, Menoufi K, AbdelMoety A, Emad-Eldeen A. Evaluating machine learning and deep learning models for predicting wind turbine power output from environmental factors. PLoS One. 2025;20(1):e0317619. https://doi.org/10.1371/journal.pone.0317619 PMID: 39847588 15. Lukoševičius M, Jaeger H. Reservoir computing approaches to recurrent neural network training. Comput Sci Rev. 2009;3(3):127–49. https://doi.org/10.1016/j.cosrev.2009.03.005 16. Maass W, Natschläger T, Markram H. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 2002;14(11):2531–60. https://doi.org/10.1162/089976602760407955 PMID: 12433288 17. Jaeger H. The “echo state” approach to analysing and training recurrent neural networks. GMD Report 148. 2001. Available from: https://api.semanticscholar.org/CorpusID:15467150 18. Tanaka Y, Tamukoh H. Reservoir-based convolution. NOLTA. 2022;13(2):397–402. https://doi.org/10.1587/nolta.13.397 19. Tanaka G, Yamane T, Héroux JB, Nakane R, Kanazawa N, Takeda S, et al. Recent advances in physical reservoir computing: a review. Neural Netw. 2019;115:100–23. https://doi.org/10.1016/j.neunet.2019.03.005 PMID: 30981085 20. Kawashima I, Katori Y, Morie T, Tamukoh H. An area-efficient multiply-accumulation architecture and implementations for time-domain neural processing. In: 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE; 2021, pp. 1–4. https://doi.org/10.1109/icfpt52863.2021.9609809 21. Usami Y, van de Ven B, Mathew DG, Chen T, Kotooka T, Kawashima Y, et al. In-materio reservoir computing in a sulfonated polyaniline network. Adv Mater. 2021;33(48):e2102688. https://doi.org/10.1002/adma.202102688 PMID: 34533867 22. Honda K, Tamukoh H. A hardware-oriented echo state network and its FPGA implementation. JRNAL. 2020;7(1):58. https://doi.org/10.2991/jrnal.k.200512.012 23. Bianchi FM, Scardapane S, Lokse S, Jenssen R. Reservoir computing approaches for representation and classification of multivariate time series. IEEE Trans Neural Netw Learn Syst. 2021;32(5):2169–79. https://doi.org/10.1109/TNNLS.2020.3001377 PMID: 32598284 24. Yasumuro M, Jin’no K. Japanese fingerspelling identification by using MediaPipe. NOLTA. 2022;13(2):288–93. https://doi.org/10.1587/nolta.13.288 25. Bajaj Y, Malhotra P. American sign language identification using hand trackpoint analysis. In: International Conference on Innovative Computing and Communications. Singapore: Springer; 2022, pp. 159–71. 26. Attia NF, Ahmed MTFS, Alshewimy MAM. Efficient deep learning models based on tension techniques for sign language recognition. Intell Syst Appl. 2023;20:200284. https://doi.org/10.1016/j.iswa.2023.200284 27. Takayama N, Benitez-Garcia G, Takahashi H. Masked batch normalization to improve tracking-based sign language recognition using graph convolutional networks. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). 2021, pp. 1–5. 28. Luqman H. An efficient two-stream network for isolated sign language recognition using accumulative video motion. IEEE Access. 2022;10:93785–98. https://doi.org/10.1109/access.2022.3204110 29. Samaan GH, Wadie AR, Attia AK, Asaad AM, Kamel AE, Slim SO, et al. MediaPipe’s landmarks with RNN for dynamic sign language recognition. Electronics. 2022;11(19):3228. https://doi.org/10.3390/electronics11193228 30. Lukoševičius M, Jaeger H, Schrauwen B. Reservoir computing trends. Künstl Intell. 2012;26(4):365–71. https://doi.org/10.1007/s13218-012-0204-5 31. Martinuzzi F, Rackauckas C, Abdelrehim A, Mahecha M, Mora K. Reservoircomputing.jl: an efficient and modular library for reservoir computing models. J Mach Learn Res. 2022;23(288):1–8. 32. Maass W, Markram H. On the computational power of circuits of spiking neurons. J Comput Syst Sci. 2004;69(4):593–616. https://doi.org/10.1016/j.jcss.2004.04.001 33. Ma Q, Shen L, Cottrell GW. DeePr-ESN: a deep projection-encoding echo-state network. Inform Sci. 2020;511:152–71. https://doi.org/10.1016/j.ins.2019.09.049 34. Li Z, Tanaka G. Multi-reservoir echo state networks with sequence resampling for nonlinear time-series prediction. Neurocomputing. 2022;467:115–29. https://doi.org/10.1016/j.neucom.2021.08.122 35. Gallicchio C, Micheli A, Pedrelli L. Deep reservoir computing: a critical experimental analysis. Neurocomputing. 2017;268:87–99. https://doi.org/10.1016/j.neucom.2016.12.089 PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 29/ 30 ID: pone.0322717 — 2025/7/27 — page 30 — #30 PLOS ONE Low-cost computation for isolated sign language video recognition 36. Li Z, Liu Y, Tanaka G. Multi-reservoir echo state networks with Hodrick–Prescott filter for nonlinear time-series prediction. Appl Soft Comput. 2023;135:110021. https://doi.org/10.1016/j.asoc.2023.110021 37. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM Press; 2019, pp. 2623–31. https://doi.org/10.1145/3292500.3330701 38. Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y. OpenPose: real time multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell. 2021;43(1):172–86. https://doi.org/10.1109/TPAMI.2019.2929257 PMID: 31331883 39. Tanaka Y, Morie T, Tamukoh H. An amygdala-inspired classical conditioning model implemented on an FPGA for home service robots. IEEE Access. 2020;8:212066–78. https://doi.org/10.1109/access.2020.3038161 40. Yoshioka K, Tanaka Y, Tamukoh H. LUTNet-RC: look-up tables networks for reservoir computing on an FPGA. In: 2023 International Conference on Field Programmable Technology (ICFPT). 2023, pp. 170–8. 41. Yoshioka K, Katori Y, Tanaka Y, Nomura O, Morie T, Tamukoh H. FpgA implementation of a chaotic Boltzmann machine annealer. In: 2023 International Joint Conference on Neural Networks (IJCNN). IEEE; 2023, pp. 1–8. PLOS ONE https://doi.org/10.1371/journal.pone.0322717 July 30, 2025 30/ 30

Journal

PLoS ONE – Public Library of Science (PLoS) Journal

Published: Jul 30, 2025

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Low-cost computation for isolated sign language video recognition with multiple reservoir computing

Low-cost computation for isolated sign language video recognition with multiple reservoir computing

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Low-cost computation for isolated sign language video recognition with multiple reservoir computing

Low-cost computation for isolated sign language video recognition with multiple reservoir computing

References (42)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies