简单线性回归

2024-12-19 技术教程

协方差：两个变量总体误差的期望。

简单的说就是度量Y和X之间关系的方向和强度。

X ：预测变量
Y ：响应变量

Y和X的协方差：［来度量各个维度偏离其均值的程度］

备注:［之所以除以n-1而不是除以n，是因为这样能使我们以较小的样本集更好的逼近总体的协方差，即统计上所谓的“无偏估计”。而方差则仅仅是标准差的平方］

如果结果为正值，则说明两者是正相关的(从协方差可以引出“相关系数”的定义)，
如果结果为负值就说明负相关的
如果为0，也是就是统计上说的“相互独立”

为什么呢：

如果第1，3象限点位多，最终的和就是正，X增大Y增大
如果第2，4象限点位多，最终的和就是负，X增大Y减小

Cov(Y,X)会受到度量单位的影响

引入相关系数：

python使用以下公式进行计算［上面的公式不便于编程，需要多次扫描数据，但是微小的错误会被放大哦］：

#coding:utf-8'''Y和X的相关系数就是标准化后变量的协方差''''''__author__='similarface'QQ:841196883@qq.com'''frommathimportsqrtfrompandasimport*importpandasaspdimportos,sysimportmatplotlib.pyplotasplt#安装不了就github下载源码安装fromsklearnimportdatasets,linear_model'''根据文件加载数据'''defloaddataInTab(filename):ifos.path.isfile(filename):try:returnpd.read_table(filename)exceptException,e:print(e.message)else:print("文件存在!")returnNone'''获取Y，X的相关系数，即为：皮尔逊相关系数'''defpearson(rating1,rating2):'''皮尔逊相关参数在统计学中，皮尔逊积矩相关系数（英语：Pearsonproduct-momentcorrelationcoefficient，又称作PPMCC或PCCs[1],文章中常用r或Pearson'sr表示）用于度量两个变量X和Y之间的相关（线性相关），其值介于-1与1之间。在自然科学领域中，该系数广泛用于度量两个变量之间的相关程度。0.8-1.0极强相关0.6-0.8强相关0.4-0.6中等程度相关0.2-0.4弱相关0.0-0.2极弱相关或无相关'''sum_xy,sum_x,sum_y,sum_x2,sum_y2,n=0,0,0,0,0,0foriinxrange(len(rating1)):n=n+1x=rating1[i]y=rating2[i]sum_xy+=x*ysum_x+=xsum_y+=ysum_x2+=x**2sum_y2+=y**2ifn==0:return0fenmu=sqrt(sum_x2-(sum_x**2)/n)*sqrt(sum_y2-(sum_y**2)/n)iffenmu==0:return0else:return(sum_xy-(sum_x*sum_y)/n)/fenmudata=loaddataInTab('./AnscombeQuartet')#x1,y1是线性相关的diabetes_x1_test=data['X1']diabetes_y1_test=data['Y1']plt.scatter(diabetes_x1_test,diabetes_y1_test,color='black')print("黑色点的相关系数为:{}(皮尔逊相关参数)".format(pearson(diabetes_x1_test,diabetes_y1_test)))regr1=linear_model.LinearRegression()diabetes_x1_train=diabetes_x1_test.as_matrix()[:,np.newaxis]diabetes_y1_train=diabetes_y1_test.as_matrix()[:,np.newaxis]regr1.fit(diabetes_x1_train,diabetes_y1_train)plt.plot(diabetes_x1_test.as_matrix()[:,np.newaxis],regr1.predict(diabetes_x1_test.as_matrix()[:,np.newaxis]),color='black',linewidth=6)#x2,y2是非线性二次函数拟合diabetes_x2_test=data['X2']diabetes_y2_test=data['Y2']plt.scatter(diabetes_x2_test,diabetes_y2_test,color='red')print("红色点的相关系数为:{}(皮尔逊相关参数)".format(pearson(diabetes_x2_test,diabetes_y2_test)))regr2=linear_model.LinearRegression()diabetes_x2_train=diabetes_x2_test.as_matrix()[:,np.newaxis]diabetes_y2_train=diabetes_y2_test.as_matrix()[:,np.newaxis]regr2.fit(diabetes_x2_train,diabetes_y2_train)plt.plot(diabetes_x2_test.as_matrix()[:,np.newaxis],regr2.predict(diabetes_x2_test.as_matrix()[:,np.newaxis]),color='red',linewidth=4)#x3,y3数据对中出现了孤立点diabetes_x3_test=data['X3']diabetes_y3_test=data['Y3']plt.scatter(diabetes_x3_test,diabetes_y3_test,color='blue')print("蓝色点的相关系数为:{}(皮尔逊相关参数)".format(pearson(diabetes_x3_test,diabetes_y3_test)))regr3=linear_model.LinearRegression()diabetes_x3_train=diabetes_x3_test.as_matrix()[:,np.newaxis]diabetes_y3_train=diabetes_y3_test.as_matrix()[:,np.newaxis]regr3.fit(diabetes_x3_train,diabetes_y3_train)plt.plot(diabetes_x3_test.as_matrix()[:,np.newaxis],regr3.predict(diabetes_x3_test.as_matrix()[:,np.newaxis]),color='blue',linewidth=2)#x4,y4不适合线性拟合极端值确立了直线diabetes_x4_test=data['X4']diabetes_y4_test=data['Y4']plt.scatter(diabetes_x4_test,diabetes_y4_test,color='green')print("绿色点的相关系数为:{}(皮尔逊相关参数)".format(pearson(diabetes_x4_test,diabetes_y4_test)))regr4=linear_model.LinearRegression()diabetes_x4_train=diabetes_x4_test.as_matrix()[:,np.newaxis]diabetes_y4_train=diabetes_y4_test.as_matrix()[:,np.newaxis]regr4.fit(diabetes_x4_train,diabetes_y4_train)plt.plot(diabetes_x4_test.as_matrix()[:,np.newaxis],regr4.predict(diabetes_x4_test.as_matrix()[:,np.newaxis]),color='green',linewidth=1)plt.xticks(())plt.yticks(())plt.show()'''把上面的4组数据去做线性回归：有图可知都做出了，斜率和截距相等的拟合线性4种X，Y的相关系数都很接近在解释相关系数之前，图像点位的散点分布是很重要的如果完全基于相关系数分析数据，将无法发现数据构造模式之间的差别'''

参考数据：

123456789101112Y1 X1 Y2 X2 Y3 X3 Y4 X48.04109.14107.46106.5886.9588.1486.7785.7687.58138.741312.74137.7188.8198.7797.1198.8488.33119.26117.81118.4789.96148.1148.84147.0487.2466.1366.0865.2584.2643.145.39412.51910.84129.13128.15125.5684.8277.2676.4277.9185.6854.7455.7356.898

实例：计算机维修

#coding:utf-8'''实例：计算机维修维修时间零件个数MinutesUnits1...'''__author__='similarface'frommathimportsqrtfrompandasimport*importpandasaspdimportosimportmatplotlib.pyplotasplt#安装不了就github下载源码安装fromsklearnimportdatasets,linear_model'''根据文件加载数据'''defloaddataInTab(filename):ifos.path.isfile(filename):try:returnpd.read_table(filename)exceptException,e:print(e.message)else:print("文件存在!")returnNone'''获取Y，X的相关系数，即为：皮尔逊相关系数'''defpearson(rating1,rating2):'''皮尔逊相关参数'''sum_xy,sum_x,sum_y,sum_x2,sum_y2,n=0,0,0,0,0,0foriinxrange(len(rating1)):n=n+1x=rating1[i]y=rating2[i]sum_xy+=x*ysum_x+=xsum_y+=ysum_x2+=x**2sum_y2+=y**2ifn==0:return0fenmu=sqrt(sum_x2-(sum_x**2)/n)*sqrt(sum_y2-(sum_y**2)/n)iffenmu==0:return0else:return(sum_xy-(sum_x*sum_y)/n)/fenmudefgetCovariance(XList,YList):'''获取协方差'''averageX=sum(XList)/float(len(XList))averageY=sum(YList)/float(len(YList))sumall=0.0foriinxrange(len(XList)):sumall=sumall+(XList[i]-averageX)*(YList[i]-averageY)cov=sumall/(len(XList)-1)returncovcomputerrepairdata=loaddataInTab('./data/computerrepair.data')Minutes=computerrepairdata['Minutes']Units=computerrepairdata['Units']print("该数据集的协方差：{}".format(getCovariance(Minutes,Units)))print("该数据集相关系数：{}".format(pearson(Minutes,Units)))#维修时间元件个数散点图plt.scatter(Minutes,Units,color='black')regr=linear_model.LinearRegression()Minutes_train=Minutes.as_matrix()[:,np.newaxis]Units_train=Units.as_matrix()[:,np.newaxis]regr.fit(Minutes_train,Units_train)print"相关信息"print("==================================================")print("方程斜率:{}".format(regr.coef_))print("方程截距:{}".format(regr._residues))print("回归方程:y={}x+{}".format(regr.coef_[0][0],regr._residues[0]))print("==================================================")plt.plot(Minutes_train,regr.predict(Minutes_train),color='red',linewidth=1)plt.xticks(())plt.yticks(())plt.show()

result:

该数据集的协方差：136.0
该数据集相关系数：0.993687243916
相关信息
==================================================
方程斜率:` 0`.`06366959`
方程截距:[ 1.43215942]
回归方程:y=0.0636695930877x+1.43215942092
==================================================

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝参数估计＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

上面的斜率和截距如何算出来的？

简单线性回归模型：
　　Y=β0+β1X+ε

　　β0：常数项目

　　β1：回归系数

　　ε：随机扰动或误差［除X和Y的线性关系之外的随机因素对Y的影响］

单个观测：

yi是响应变量的第i个观测值，xi是X的第i个预测值

误差表达式：

铅直距离平方和：［真实响应变量－预测响应变量］的平方和［最小二乘法］

使得上诉公式最小值的β[0,1]即为所求的

直接给出该式的参数解：

其中带入化简：

算最小2乘法的例子：

(x,y):(1,6),(2,5),(3,7),(4,10)

假设：y=β1+β2x

β1+1β2=6

β1+2β2=5

β1+3β2=7

β1+4β2=10

得铅直距离方：

化简：

注：这个地方使用了编导数，凹曲面上的极值点在切线斜率为0处

转化成2维面

就解2元1次方程组一个为3.5 一个为1.4

使用公式计算：

β2＝(1-2.5)*(6-7)+(2-2.5)*(5-7)+(3-2.5)(7-7)+(4-2.5)(10-7)/[(1-2.5)(1-2.5)+(2-2.5)(2-2.5)

+(3-2.5)(3-2.5)+(4-2.5)(4-2.5)]=7/5=1.4

python代码计算：

deflsqr(A,b,damp=0.0,atol=1e-8,btol=1e-8,conlim=1e8,iter_lim=None,show=False,calc_var=False):"""Findtheleast-squaressolutiontoalarge,sparse,linearsystemofequations.Thefunctionsolves``Ax=b``or``min||b-Ax||^2``or``min||Ax-b||^2+d^2||x||^2``.ThematrixAmaybesquareorrectangular(over-determinedorunder-determined),andmayhaveanyrank.::1.Unsymmetricequations--solveA*x=b2.Linearleastsquares--solveA*x=bintheleast-squaressense3.Dampedleastsquares--solve(A)*x=(b)(damp*I)(0)intheleast-squaressenseParameters----------A:{sparsematrix,ndarray,LinearOperatorLinear}Representationofanm-by-nmatrix.Itisrequiredthatthelinearoperatorcanproduce``Ax``and``A^Tx``.b:(m,)ndarrayRight-handsidevector``b``.damp:floatDampingcoefficient.atol,btol:floatStoppingtolerances.Ifbothare1.0e-9(say),thefinalresidualnormshouldbeaccuratetoabout9digits.(Thefinalxwillusuallyhavefewercorrectdigits,dependingoncond(A)andthesizeofdamp.)conlim:floatAnotherstoppingtolerance.lsqrterminatesifanestimateof``cond(A)``exceeds`conlim`.Forcompatiblesystems``Ax=b``,`conlim`couldbeaslargeas1.0e+12(say).Forleast-squaresproblems,conlimshouldbelessthan1.0e+8.Maximumprecisioncanbeobtainedbysetting``atol=btol=conlim=zero``,butthenumberofiterationsmaythenbeexcessive.iter_lim:intExplicitlimitationonnumberofiterations(forsafety).show:boolDisplayaniterationlog.calc_var:boolWhethertoestimatediagonalsof``(A'A+damp^2*I)^{-1}``.Returns-------x:ndarrayoffloatThefinalsolution.istop:intGivesthereasonfortermination.1meansxisanapproximatesolutiontoAx=b.2meansxapproximatelysolvestheleast-squaresproblem.itn:intIterationnumberupontermination.r1norm:float``norm(r)``,where``r=b-Ax``.r2norm:float``sqrt(norm(r)^2+damp^2*norm(x)^2)``.Equalto`r1norm`if``damp==0``.anorm:floatEstimateofFrobeniusnormof``Abar=[[A];[damp*I]]``.acond:floatEstimateof``cond(Abar)``.arnorm:floatEstimateof``norm(A'*r-damp^2*x)``.xnorm:float``norm(x)``var:ndarrayoffloatIf``calc_var``isTrue,estimatesalldiagonalsof``(A'A)^{-1}``(if``damp==0``)ormoregenerally``(A'A+damp^2*I)^{-1}``.ThisiswelldefinedifAhasfullcolumnrankor``damp>0``.(Notsurewhatvarmeansif``rank(A)<n``and``damp=0.``)Notes-----LSQRusesaniterativemethodtoapproximatethesolution.Thenumberofiterationsrequiredtoreachacertainaccuracydependsstronglyonthescalingoftheproblem.PoorscalingoftherowsorcolumnsofAshouldthereforebeavoidedwherepossible.Forexample,inproblem1thesolutionisunalteredbyrow-scaling.IfarowofAisverysmallorlargecomparedtotheotherrowsofA,thecorrespondingrowof(Ab)shouldbescaledupordown.Inproblems1and2,thesolutionxiseasilyrecoveredfollowingcolumn-scaling.Unlessbetterinformationisknown,thenonzerocolumnsofAshouldbescaledsothattheyallhavethesameEuclideannorm(e.g.,1.0).Inproblem3,thereisnofreedomtore-scaleifdampisnonzero.However,thevalueofdampshouldbeassignedonlyafterattentionhasbeenpaidtothescalingofA.Theparameterdampisintendedtohelpregularizeill-conditionedsystems,bypreventingthetruesolutionfrombeingverylarge.Anotheraidtoregularizationisprovidedbytheparameteracond,whichmaybeusedtoterminateiterationsbeforethecomputedsolutionbecomesverylarge.Ifsomeinitialestimate``x0``isknownandif``damp==0``,onecouldproceedasfollows:1.Computearesidualvector``r0=b-A*x0``.2.UseLSQRtosolvethesystem``A*dx=r0``.3.Addthecorrectiondxtoobtainafinalsolution``x=x0+dx``.Thisrequiresthat``x0``beavailablebeforeandafterthecalltoLSQR.Tojudgethebenefits,supposeLSQRtakesk1iterationstosolveA*x=bandk2iterationstosolveA*dx=r0.Ifx0is"good",norm(r0)willbesmallerthannorm(b).Ifthesamestoppingtolerancesatolandbtolareusedforeachsystem,k1andk2willbesimilar,butthefinalsolutionx0+dxshouldbemoreaccurate.Theonlywaytoreducethetotalworkistousealargerstoppingtoleranceforthesecondsystem.IfsomevaluebtolissuitableforA*x=b,thelargervaluebtol*norm(b)/norm(r0)shouldbesuitableforA*dx=r0.Preconditioningisanotherwaytoreducethenumberofiterations.Ifitispossibletosolvearelatedsystem``M*x=b``efficiently,whereMapproximatesAinsomehelpfulway(e.g.M-AhaslowrankoritselementsaresmallrelativetothoseofA),LSQRmayconvergemorerapidlyonthesystem``A*M(inverse)*z=b``,afterwhichxcanberecoveredbysolvingM*x=z.IfAissymmetric,LSQRshouldnotbeused!Alternativesarethesymmetricconjugate-gradientmethod(cg)and/orSYMMLQ.SYMMLQisanimplementationofsymmetriccgthatappliestoanysymmetricAandwillconvergemorerapidlythanLSQR.IfAispositivedefinite,thereareotherimplementationsofsymmetriccgthatrequireslightlylessworkperiterationthanSYMMLQ(butwilltakethesamenumberofiterations).References----------..[1]C.C.PaigeandM.A.Saunders(1982a)."LSQR:Analgorithmforsparselinearequationsandsparseleastsquares",ACMTOMS8(1),43-71...[2]C.C.PaigeandM.A.Saunders(1982b)."Algorithm583.LSQR:Sparselinearequationsandleastsquaresproblems",ACMTOMS8(2),195-209...[3]M.A.Saunders(1995)."SolutionofsparserectangularsystemsusingLSQRandCRAIG",BIT35,588-604."""A=aslinearoperator(A)iflen(b.shape)>1:b=b.squeeze()m,n=A.shapeifiter_limisNone:iter_lim=2*nvar=np.zeros(n)msg=('Theexactsolutionisx=0','Ax-bissmallenough,givenatol,btol','Theleast-squaressolutionisgoodenough,givenatol','Theestimateofcond(Abar)hasexceededconlim','Ax-bissmallenoughforthismachine','Theleast-squaressolutionisgoodenoughforthismachine','Cond(Abar)seemstobetoolargeforthismachine','Theiterationlimithasbeenreached')itn=0istop=0nstop=0ctol=0ifconlim>0:ctol=1/conlimanorm=0acond=0dampsq=damp**2ddnorm=0res2=0xnorm=0xxnorm=0z=0cs2=-1sn2=0"""Setupthefirstvectorsuandvforthebidiagonalization.Thesesatisfybeta*u=b,alfa*v=A'u."""__xm=np.zeros(m)#amatrixfortemporaryholding__xn=np.zeros(n)#amatrixfortemporaryholdingv=np.zeros(n)u=bx=np.zeros(n)alfa=0beta=np.linalg.norm(u)w=np.zeros(n)ifbeta>0:u=(1/beta)*uv=A.rmatvec(u)alfa=np.linalg.norm(v)ifalfa>0:v=(1/alfa)*vw=v.copy()rhobar=alfaphibar=betabnorm=betarnorm=betar1norm=rnormr2norm=rnorm#Reversetheorderherefromtheoriginalmatlabcodebecause#therewasanerroronreturnwhenarnorm==0arnorm=alfa*betaifarnorm==0:print(msg[0])returnx,istop,itn,r1norm,r2norm,anorm,acond,arnorm,xnorm,varhead1='Itnx[0]r1normr2norm'head2='CompatibleLSNormACondA'ifshow:print('')print(head1,head2)test1=1test2=alfa/betastr1='%6g%12.5e'%(itn,x[0])str2='%10.3e%10.3e'%(r1norm,r2norm)str3='%8.1e%8.1e'%(test1,test2)print(str1,str2,str3)#Mainiterationloop.whileitn<iter_lim:itn=itn+1"""%Performthenextstepofthebidiagonalizationtoobtainthe%nextbeta,u,alfa,v.Thesesatisfytherelations%beta*u=a*v-alfa*u,%alfa*v=A'*u-beta*v."""u=A.matvec(v)-alfa*ubeta=np.linalg.norm(u)ifbeta>0:u=(1/beta)*uanorm=sqrt(anorm**2+alfa**2+beta**2+damp**2)v=A.rmatvec(u)-beta*valfa=np.linalg.norm(v)ifalfa>0:v=(1/alfa)*v#Useaplanerotationtoeliminatethedampingparameter.#Thisaltersthediagonal(rhobar)ofthelower-bidiagonalmatrix.rhobar1=sqrt(rhobar**2+damp**2)cs1=rhobar/rhobar1sn1=damp/rhobar1psi=sn1*phibarphibar=cs1*phibar#Useaplanerotationtoeliminatethesubdiagonalelement(beta)#ofthelower-bidiagonalmatrix,givinganupper-bidiagonalmatrix.cs,sn,rho=_sym_ortho(rhobar1,beta)theta=sn*alfarhobar=-cs*alfaphi=cs*phibarphibar=sn*phibartau=sn*phi#Updatexandw.t1=phi/rhot2=-theta/rhodk=(1/rho)*wx=x+t1*ww=v+t2*wddnorm=ddnorm+np.linalg.norm(dk)**2ifcalc_var:var=var+dk**2#Useaplanerotationontherighttoeliminatethe#super-diagonalelement(theta)oftheupper-bidiagonalmatrix.#Thenusetheresulttoestimatenorm(x).delta=sn2*rhogambar=-cs2*rhorhs=phi-delta*zzbar=rhs/gambarxnorm=sqrt(xxnorm+zbar**2)gamma=sqrt(gambar**2+theta**2)cs2=gambar/gammasn2=theta/gammaz=rhs/gammaxxnorm=xxnorm+z**2#Testforconvergence.#First,estimatetheconditionofthematrixAbar,#andthenormsofrbarandAbar'rbar.acond=anorm*sqrt(ddnorm)res1=phibar**2res2=res2+psi**2rnorm=sqrt(res1+res2)arnorm=alfa*abs(tau)#Distinguishbetween#r1norm=||b-Ax||and#r2norm=rnormincurrentcode#=sqrt(r1norm^2+damp^2*||x||^2).#Estimater1normfrom#r1norm=sqrt(r2norm^2-damp^2*||x||^2).#Althoughthereiscancellation,itmightbeaccurateenough.r1sq=rnorm**2-dampsq*xxnormr1norm=sqrt(abs(r1sq))ifr1sq<0:r1norm=-r1normr2norm=rnorm#Nowusethesenormstoestimatecertainotherquantities,#someofwhichwillbesmallnearasolution.test1=rnorm/bnormtest2=arnorm/(anorm*rnorm)test3=1/acondt1=test1/(1+anorm*xnorm/bnorm)rtol=btol+atol*anorm*xnorm/bnorm#Thefollowingtestsguardagainstextremelysmallvaluesof#atol,btolorctol.(Theusermayhavesetanyorallof#theparametersatol,btol,conlimto0.)#Theeffectisequivalenttothenormaltestsusing#atol=eps,btol=eps,conlim=1/eps.ifitn>=iter_lim:istop=7if1+test3<=1:istop=6if1+test2<=1:istop=5if1+t1<=1:istop=4#Allowfortolerancessetbytheuser.iftest3<=ctol:istop=3iftest2<=atol:istop=2iftest1<=rtol:istop=1#Seeifitistimetoprintsomething.prnt=Falseifn<=40:prnt=Trueifitn<=10:prnt=Trueifitn>=iter_lim-10:prnt=True#ifitn%10==0:prnt=Trueiftest3<=2*ctol:prnt=Trueiftest2<=10*atol:prnt=Trueiftest1<=10*rtol:prnt=Trueifistop!=0:prnt=Trueifprnt:ifshow:str1='%6g%12.5e'%(itn,x[0])str2='%10.3e%10.3e'%(r1norm,r2norm)str3='%8.1e%8.1e'%(test1,test2)str4='%8.1e%8.1e'%(anorm,acond)print(str1,str2,str3,str4)ifistop!=0:break#Endofiterationloop.returnx,istop,itn,r1norm,r2norm,anorm,acond,arnorm,xnorm,var

＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝

校验假设：

虽然上面的实例算出了线性关系，但是我们怎么判断我们开始预设立得假设呢，就是我们是否要推翻假设，毕竟是假设，

假设未来一周我有一个亿，根据现实偏离太大，估计会推翻假设。

算了还是切入正题吧：

开始我假设了个模型：

我们需要校验这个假设，我们还要附加假定：

对于X的每一个固定值，所以的ε都相互独立，并且都服从均值为0，方程为σ方的正态分布

得到：

SSE为残差，n-2为自由度［自由度＝观测个数－待估回归参数个数］

带入化简得倒，标准误差分别是：

的标准误差刻画了斜率的估计精度，标准误越小估计精度越高