简单线性回归
协方差:两个变量总体误差的期望。
简单的说就是度量Y和X之间关系的方向和强度。
X :预测变量
Y :响应变量
Y和X的协方差:[来度量各个维度偏离其均值的程度]
备注:[之所以除以n-1而不是除以n,是因为这样能使我们以较小的样本集更好的逼近总体的协方差,即统计上所谓的“无偏估计”。而方差则仅仅是标准差的平方]
如果结果为正值,则说明两者是正相关的(从协方差可以引出“相关系数”的定义),
如果结果为负值就说明负相关的
如果为0,也是就是统计上说的“相互独立”
为什么呢:
如果第1,3象限点位多,最终的和就是正,X增大Y增大
如果第2,4象限点位多,最终的和就是负,X增大Y减小
Cov(Y,X)会受到度量单位的影响
引入相关系数:
python使用以下公式进行计算[上面的公式不便于编程,需要多次扫描数据,但是微小的错误会被放大哦]:
#coding:utf-8'''Y和X的相关系数就是标准化后变量的协方差''''''__author__='similarface'QQ:841196883@qq.com'''frommathimportsqrtfrompandasimport*importpandasaspdimportos,sysimportmatplotlib.pyplotasplt#安装不了就github下载源码安装fromsklearnimportdatasets,linear_model'''根据文件加载数据'''defloaddataInTab(filename):ifos.path.isfile(filename):try:returnpd.read_table(filename)exceptException,e:print(e.message)else:print("文件存在!")returnNone'''获取Y,X的相关系数,即为:皮尔逊相关系数'''defpearson(rating1,rating2):'''皮尔逊相关参数在统计学中,皮尔逊积矩相关系数(英语:Pearsonproduct-momentcorrelationcoefficient,又称作PPMCC或PCCs[1],文章中常用r或Pearson'sr表示)用于度量两个变量X和Y之间的相关(线性相关),其值介于-1与1之间。在自然科学领域中,该系数广泛用于度量两个变量之间的相关程度。0.8-1.0极强相关0.6-0.8强相关0.4-0.6中等程度相关0.2-0.4弱相关0.0-0.2极弱相关或无相关'''sum_xy,sum_x,sum_y,sum_x2,sum_y2,n=0,0,0,0,0,0foriinxrange(len(rating1)):n=n+1x=rating1[i]y=rating2[i]sum_xy+=x*ysum_x+=xsum_y+=ysum_x2+=x**2sum_y2+=y**2ifn==0:return0fenmu=sqrt(sum_x2-(sum_x**2)/n)*sqrt(sum_y2-(sum_y**2)/n)iffenmu==0:return0else:return(sum_xy-(sum_x*sum_y)/n)/fenmudata=loaddataInTab('./AnscombeQuartet')#x1,y1是线性相关的diabetes_x1_test=data['X1']diabetes_y1_test=data['Y1']plt.scatter(diabetes_x1_test,diabetes_y1_test,color='black')print("黑色点的相关系数为:{}(皮尔逊相关参数)".format(pearson(diabetes_x1_test,diabetes_y1_test)))regr1=linear_model.LinearRegression()diabetes_x1_train=diabetes_x1_test.as_matrix()[:,np.newaxis]diabetes_y1_train=diabetes_y1_test.as_matrix()[:,np.newaxis]regr1.fit(diabetes_x1_train,diabetes_y1_train)plt.plot(diabetes_x1_test.as_matrix()[:,np.newaxis],regr1.predict(diabetes_x1_test.as_matrix()[:,np.newaxis]),color='black',linewidth=6)#x2,y2是非线性二次函数拟合diabetes_x2_test=data['X2']diabetes_y2_test=data['Y2']plt.scatter(diabetes_x2_test,diabetes_y2_test,color='red')print("红色点的相关系数为:{}(皮尔逊相关参数)".format(pearson(diabetes_x2_test,diabetes_y2_test)))regr2=linear_model.LinearRegression()diabetes_x2_train=diabetes_x2_test.as_matrix()[:,np.newaxis]diabetes_y2_train=diabetes_y2_test.as_matrix()[:,np.newaxis]regr2.fit(diabetes_x2_train,diabetes_y2_train)plt.plot(diabetes_x2_test.as_matrix()[:,np.newaxis],regr2.predict(diabetes_x2_test.as_matrix()[:,np.newaxis]),color='red',linewidth=4)#x3,y3数据对中出现了孤立点diabetes_x3_test=data['X3']diabetes_y3_test=data['Y3']plt.scatter(diabetes_x3_test,diabetes_y3_test,color='blue')print("蓝色点的相关系数为:{}(皮尔逊相关参数)".format(pearson(diabetes_x3_test,diabetes_y3_test)))regr3=linear_model.LinearRegression()diabetes_x3_train=diabetes_x3_test.as_matrix()[:,np.newaxis]diabetes_y3_train=diabetes_y3_test.as_matrix()[:,np.newaxis]regr3.fit(diabetes_x3_train,diabetes_y3_train)plt.plot(diabetes_x3_test.as_matrix()[:,np.newaxis],regr3.predict(diabetes_x3_test.as_matrix()[:,np.newaxis]),color='blue',linewidth=2)#x4,y4不适合线性拟合极端值确立了直线diabetes_x4_test=data['X4']diabetes_y4_test=data['Y4']plt.scatter(diabetes_x4_test,diabetes_y4_test,color='green')print("绿色点的相关系数为:{}(皮尔逊相关参数)".format(pearson(diabetes_x4_test,diabetes_y4_test)))regr4=linear_model.LinearRegression()diabetes_x4_train=diabetes_x4_test.as_matrix()[:,np.newaxis]diabetes_y4_train=diabetes_y4_test.as_matrix()[:,np.newaxis]regr4.fit(diabetes_x4_train,diabetes_y4_train)plt.plot(diabetes_x4_test.as_matrix()[:,np.newaxis],regr4.predict(diabetes_x4_test.as_matrix()[:,np.newaxis]),color='green',linewidth=1)plt.xticks(())plt.yticks(())plt.show()'''把上面的4组数据去做线性回归:有图可知都做出了,斜率和截距相等的拟合线性4种X,Y的相关系数都很接近在解释相关系数之前,图像点位的散点分布是很重要的如果完全基于相关系数分析数据,将无法发现数据构造模式之间的差别'''
参考数据:
123456789101112Y1 X1 Y2 X2 Y3 X3 Y4 X4
8.04
10
9.14
10
7.46
10
6.58
8
6.95
8
8.14
8
6.77
8
5.76
8
7.58
13
8.74
13
12.74
13
7.71
8
8.81
9
8.77
9
7.11
9
8.84
8
8.33
11
9.26
11
7.81
11
8.47
8
9.96
14
8.1
14
8.84
14
7.04
8
7.24
6
6.13
6
6.08
6
5.25
8
4.26
4
3.1
4
5.39
4
12.5
19
10.84
12
9.13
12
8.15
12
5.56
8
4.82
7
7.26
7
6.42
7
7.91
8
5.68
5
4.74
5
5.73
5
6.89
8
实例:计算机维修
#coding:utf-8'''实例:计算机维修维修时间零件个数MinutesUnits1...'''__author__='similarface'frommathimportsqrtfrompandasimport*importpandasaspdimportosimportmatplotlib.pyplotasplt#安装不了就github下载源码安装fromsklearnimportdatasets,linear_model'''根据文件加载数据'''defloaddataInTab(filename):ifos.path.isfile(filename):try:returnpd.read_table(filename)exceptException,e:print(e.message)else:print("文件存在!")returnNone'''获取Y,X的相关系数,即为:皮尔逊相关系数'''defpearson(rating1,rating2):'''皮尔逊相关参数'''sum_xy,sum_x,sum_y,sum_x2,sum_y2,n=0,0,0,0,0,0foriinxrange(len(rating1)):n=n+1x=rating1[i]y=rating2[i]sum_xy+=x*ysum_x+=xsum_y+=ysum_x2+=x**2sum_y2+=y**2ifn==0:return0fenmu=sqrt(sum_x2-(sum_x**2)/n)*sqrt(sum_y2-(sum_y**2)/n)iffenmu==0:return0else:return(sum_xy-(sum_x*sum_y)/n)/fenmudefgetCovariance(XList,YList):'''获取协方差'''averageX=sum(XList)/float(len(XList))averageY=sum(YList)/float(len(YList))sumall=0.0foriinxrange(len(XList)):sumall=sumall+(XList[i]-averageX)*(YList[i]-averageY)cov=sumall/(len(XList)-1)returncovcomputerrepairdata=loaddataInTab('./data/computerrepair.data')Minutes=computerrepairdata['Minutes']Units=computerrepairdata['Units']print("该数据集的协方差:{}".format(getCovariance(Minutes,Units)))print("该数据集相关系数:{}".format(pearson(Minutes,Units)))#维修时间元件个数散点图plt.scatter(Minutes,Units,color='black')regr=linear_model.LinearRegression()Minutes_train=Minutes.as_matrix()[:,np.newaxis]Units_train=Units.as_matrix()[:,np.newaxis]regr.fit(Minutes_train,Units_train)print"相关信息"print("==================================================")print("方程斜率:{}".format(regr.coef_))print("方程截距:{}".format(regr._residues))print("回归方程:y={}x+{}".format(regr.coef_[0][0],regr._residues[0]))print("==================================================")plt.plot(Minutes_train,regr.predict(Minutes_train),color='red',linewidth=1)plt.xticks(())plt.yticks(())plt.show()
result:
该数据集的协方差:136.0
该数据集相关系数:0.993687243916
相关信息
==================================================
方程斜率:` 0`.`06366959`
方程截距:[ 1.43215942]
回归方程:y=0.0636695930877x+1.43215942092
==================================================
===============参数估计=================
上面的斜率和截距如何算出来的?
简单线性回归模型:
Y=β0+β1X+ε
β0:常数项目
β1:回归系数
ε:随机扰动或误差[除X和Y的线性关系之外的随机因素对Y的影响]
单个观测:
yi是响应变量的第i个观测值,xi是X的第i个预测值
误差表达式:
铅直距离平方和:[真实响应变量-预测响应变量]的平方和[最小二乘法]
使得上诉公式最小值的β[0,1]即为所求的
直接给出该式的参数解:
其中带入化简:
算最小2乘法的例子:
(x,y):(1,6),(2,5),(3,7),(4,10)
假设:y=β1+β2x
β1+1β2=6
β1+2β2=5
β1+3β2=7
β1+4β2=10
得铅直距离方:
化简:
注:这个地方使用了编导数,凹曲面上的极值点在切线斜率为0处
转化成2维面
就解2元1次方程组 一个为3.5 一个为1.4
使用公式计算:
β2=(1-2.5)*(6-7)+(2-2.5)*(5-7)+(3-2.5)(7-7)+(4-2.5)(10-7)/[(1-2.5)(1-2.5)+(2-2.5)(2-2.5)
+(3-2.5)(3-2.5)+(4-2.5)(4-2.5)]=7/5=1.4
python代码计算:
deflsqr(A,b,damp=0.0,atol=1e-8,btol=1e-8,conlim=1e8,iter_lim=None,show=False,calc_var=False):"""Findtheleast-squaressolutiontoalarge,sparse,linearsystemofequations.Thefunctionsolves``Ax=b``or``min||b-Ax||^2``or``min||Ax-b||^2+d^2||x||^2``.ThematrixAmaybesquareorrectangular(over-determinedorunder-determined),andmayhaveanyrank.::1.Unsymmetricequations--solveA*x=b2.Linearleastsquares--solveA*x=bintheleast-squaressense3.Dampedleastsquares--solve(A)*x=(b)(damp*I)(0)intheleast-squaressenseParameters----------A:{sparsematrix,ndarray,LinearOperatorLinear}Representationofanm-by-nmatrix.Itisrequiredthatthelinearoperatorcanproduce``Ax``and``A^Tx``.b:(m,)ndarrayRight-handsidevector``b``.damp:floatDampingcoefficient.atol,btol:floatStoppingtolerances.Ifbothare1.0e-9(say),thefinalresidualnormshouldbeaccuratetoabout9digits.(Thefinalxwillusuallyhavefewercorrectdigits,dependingoncond(A)andthesizeofdamp.)conlim:floatAnotherstoppingtolerance.lsqrterminatesifanestimateof``cond(A)``exceeds`conlim`.Forcompatiblesystems``Ax=b``,`conlim`couldbeaslargeas1.0e+12(say).Forleast-squaresproblems,conlimshouldbelessthan1.0e+8.Maximumprecisioncanbeobtainedbysetting``atol=btol=conlim=zero``,butthenumberofiterationsmaythenbeexcessive.iter_lim:intExplicitlimitationonnumberofiterations(forsafety).show:boolDisplayaniterationlog.calc_var:boolWhethertoestimatediagonalsof``(A'A+damp^2*I)^{-1}``.Returns-------x:ndarrayoffloatThefinalsolution.istop:intGivesthereasonfortermination.1meansxisanapproximatesolutiontoAx=b.2meansxapproximatelysolvestheleast-squaresproblem.itn:intIterationnumberupontermination.r1norm:float``norm(r)``,where``r=b-Ax``.r2norm:float``sqrt(norm(r)^2+damp^2*norm(x)^2)``.Equalto`r1norm`if``damp==0``.anorm:floatEstimateofFrobeniusnormof``Abar=[[A];[damp*I]]``.acond:floatEstimateof``cond(Abar)``.arnorm:floatEstimateof``norm(A'*r-damp^2*x)``.xnorm:float``norm(x)``var:ndarrayoffloatIf``calc_var``isTrue,estimatesalldiagonalsof``(A'A)^{-1}``(if``damp==0``)ormoregenerally``(A'A+damp^2*I)^{-1}``.ThisiswelldefinedifAhasfullcolumnrankor``damp>0``.(Notsurewhatvarmeansif``rank(A)<n``and``damp=0.``)Notes-----LSQRusesaniterativemethodtoapproximatethesolution.Thenumberofiterationsrequiredtoreachacertainaccuracydependsstronglyonthescalingoftheproblem.PoorscalingoftherowsorcolumnsofAshouldthereforebeavoidedwherepossible.Forexample,inproblem1thesolutionisunalteredbyrow-scaling.IfarowofAisverysmallorlargecomparedtotheotherrowsofA,thecorrespondingrowof(Ab)shouldbescaledupordown.Inproblems1and2,thesolutionxiseasilyrecoveredfollowingcolumn-scaling.Unlessbetterinformationisknown,thenonzerocolumnsofAshouldbescaledsothattheyallhavethesameEuclideannorm(e.g.,1.0).Inproblem3,thereisnofreedomtore-scaleifdampisnonzero.However,thevalueofdampshouldbeassignedonlyafterattentionhasbeenpaidtothescalingofA.Theparameterdampisintendedtohelpregularizeill-conditionedsystems,bypreventingthetruesolutionfrombeingverylarge.Anotheraidtoregularizationisprovidedbytheparameteracond,whichmaybeusedtoterminateiterationsbeforethecomputedsolutionbecomesverylarge.Ifsomeinitialestimate``x0``isknownandif``damp==0``,onecouldproceedasfollows:1.Computearesidualvector``r0=b-A*x0``.2.UseLSQRtosolvethesystem``A*dx=r0``.3.Addthecorrectiondxtoobtainafinalsolution``x=x0+dx``.Thisrequiresthat``x0``beavailablebeforeandafterthecalltoLSQR.Tojudgethebenefits,supposeLSQRtakesk1iterationstosolveA*x=bandk2iterationstosolveA*dx=r0.Ifx0is"good",norm(r0)willbesmallerthannorm(b).Ifthesamestoppingtolerancesatolandbtolareusedforeachsystem,k1andk2willbesimilar,butthefinalsolutionx0+dxshouldbemoreaccurate.Theonlywaytoreducethetotalworkistousealargerstoppingtoleranceforthesecondsystem.IfsomevaluebtolissuitableforA*x=b,thelargervaluebtol*norm(b)/norm(r0)shouldbesuitableforA*dx=r0.Preconditioningisanotherwaytoreducethenumberofiterations.Ifitispossibletosolvearelatedsystem``M*x=b``efficiently,whereMapproximatesAinsomehelpfulway(e.g.M-AhaslowrankoritselementsaresmallrelativetothoseofA),LSQRmayconvergemorerapidlyonthesystem``A*M(inverse)*z=b``,afterwhichxcanberecoveredbysolvingM*x=z.IfAissymmetric,LSQRshouldnotbeused!Alternativesarethesymmetricconjugate-gradientmethod(cg)and/orSYMMLQ.SYMMLQisanimplementationofsymmetriccgthatappliestoanysymmetricAandwillconvergemorerapidlythanLSQR.IfAispositivedefinite,thereareotherimplementationsofsymmetriccgthatrequireslightlylessworkperiterationthanSYMMLQ(butwilltakethesamenumberofiterations).References----------..[1]C.C.PaigeandM.A.Saunders(1982a)."LSQR:Analgorithmforsparselinearequationsandsparseleastsquares",ACMTOMS8(1),43-71...[2]C.C.PaigeandM.A.Saunders(1982b)."Algorithm583.LSQR:Sparselinearequationsandleastsquaresproblems",ACMTOMS8(2),195-209...[3]M.A.Saunders(1995)."SolutionofsparserectangularsystemsusingLSQRandCRAIG",BIT35,588-604."""A=aslinearoperator(A)iflen(b.shape)>1:b=b.squeeze()m,n=A.shapeifiter_limisNone:iter_lim=2*nvar=np.zeros(n)msg=('Theexactsolutionisx=0','Ax-bissmallenough,givenatol,btol','Theleast-squaressolutionisgoodenough,givenatol','Theestimateofcond(Abar)hasexceededconlim','Ax-bissmallenoughforthismachine','Theleast-squaressolutionisgoodenoughforthismachine','Cond(Abar)seemstobetoolargeforthismachine','Theiterationlimithasbeenreached')itn=0istop=0nstop=0ctol=0ifconlim>0:ctol=1/conlimanorm=0acond=0dampsq=damp**2ddnorm=0res2=0xnorm=0xxnorm=0z=0cs2=-1sn2=0"""Setupthefirstvectorsuandvforthebidiagonalization.Thesesatisfybeta*u=b,alfa*v=A'u."""__xm=np.zeros(m)#amatrixfortemporaryholding__xn=np.zeros(n)#amatrixfortemporaryholdingv=np.zeros(n)u=bx=np.zeros(n)alfa=0beta=np.linalg.norm(u)w=np.zeros(n)ifbeta>0:u=(1/beta)*uv=A.rmatvec(u)alfa=np.linalg.norm(v)ifalfa>0:v=(1/alfa)*vw=v.copy()rhobar=alfaphibar=betabnorm=betarnorm=betar1norm=rnormr2norm=rnorm#Reversetheorderherefromtheoriginalmatlabcodebecause#therewasanerroronreturnwhenarnorm==0arnorm=alfa*betaifarnorm==0:print(msg[0])returnx,istop,itn,r1norm,r2norm,anorm,acond,arnorm,xnorm,varhead1='Itnx[0]r1normr2norm'head2='CompatibleLSNormACondA'ifshow:print('')print(head1,head2)test1=1test2=alfa/betastr1='%6g%12.5e'%(itn,x[0])str2='%10.3e%10.3e'%(r1norm,r2norm)str3='%8.1e%8.1e'%(test1,test2)print(str1,str2,str3)#Mainiterationloop.whileitn<iter_lim:itn=itn+1"""%Performthenextstepofthebidiagonalizationtoobtainthe%nextbeta,u,alfa,v.Thesesatisfytherelations%beta*u=a*v-alfa*u,%alfa*v=A'*u-beta*v."""u=A.matvec(v)-alfa*ubeta=np.linalg.norm(u)ifbeta>0:u=(1/beta)*uanorm=sqrt(anorm**2+alfa**2+beta**2+damp**2)v=A.rmatvec(u)-beta*valfa=np.linalg.norm(v)ifalfa>0:v=(1/alfa)*v#Useaplanerotationtoeliminatethedampingparameter.#Thisaltersthediagonal(rhobar)ofthelower-bidiagonalmatrix.rhobar1=sqrt(rhobar**2+damp**2)cs1=rhobar/rhobar1sn1=damp/rhobar1psi=sn1*phibarphibar=cs1*phibar#Useaplanerotationtoeliminatethesubdiagonalelement(beta)#ofthelower-bidiagonalmatrix,givinganupper-bidiagonalmatrix.cs,sn,rho=_sym_ortho(rhobar1,beta)theta=sn*alfarhobar=-cs*alfaphi=cs*phibarphibar=sn*phibartau=sn*phi#Updatexandw.t1=phi/rhot2=-theta/rhodk=(1/rho)*wx=x+t1*ww=v+t2*wddnorm=ddnorm+np.linalg.norm(dk)**2ifcalc_var:var=var+dk**2#Useaplanerotationontherighttoeliminatethe#super-diagonalelement(theta)oftheupper-bidiagonalmatrix.#Thenusetheresulttoestimatenorm(x).delta=sn2*rhogambar=-cs2*rhorhs=phi-delta*zzbar=rhs/gambarxnorm=sqrt(xxnorm+zbar**2)gamma=sqrt(gambar**2+theta**2)cs2=gambar/gammasn2=theta/gammaz=rhs/gammaxxnorm=xxnorm+z**2#Testforconvergence.#First,estimatetheconditionofthematrixAbar,#andthenormsofrbarandAbar'rbar.acond=anorm*sqrt(ddnorm)res1=phibar**2res2=res2+psi**2rnorm=sqrt(res1+res2)arnorm=alfa*abs(tau)#Distinguishbetween#r1norm=||b-Ax||and#r2norm=rnormincurrentcode#=sqrt(r1norm^2+damp^2*||x||^2).#Estimater1normfrom#r1norm=sqrt(r2norm^2-damp^2*||x||^2).#Althoughthereiscancellation,itmightbeaccurateenough.r1sq=rnorm**2-dampsq*xxnormr1norm=sqrt(abs(r1sq))ifr1sq<0:r1norm=-r1normr2norm=rnorm#Nowusethesenormstoestimatecertainotherquantities,#someofwhichwillbesmallnearasolution.test1=rnorm/bnormtest2=arnorm/(anorm*rnorm)test3=1/acondt1=test1/(1+anorm*xnorm/bnorm)rtol=btol+atol*anorm*xnorm/bnorm#Thefollowingtestsguardagainstextremelysmallvaluesof#atol,btolorctol.(Theusermayhavesetanyorallof#theparametersatol,btol,conlimto0.)#Theeffectisequivalenttothenormaltestsusing#atol=eps,btol=eps,conlim=1/eps.ifitn>=iter_lim:istop=7if1+test3<=1:istop=6if1+test2<=1:istop=5if1+t1<=1:istop=4#Allowfortolerancessetbytheuser.iftest3<=ctol:istop=3iftest2<=atol:istop=2iftest1<=rtol:istop=1#Seeifitistimetoprintsomething.prnt=Falseifn<=40:prnt=Trueifitn<=10:prnt=Trueifitn>=iter_lim-10:prnt=True#ifitn%10==0:prnt=Trueiftest3<=2*ctol:prnt=Trueiftest2<=10*atol:prnt=Trueiftest1<=10*rtol:prnt=Trueifistop!=0:prnt=Trueifprnt:ifshow:str1='%6g%12.5e'%(itn,x[0])str2='%10.3e%10.3e'%(r1norm,r2norm)str3='%8.1e%8.1e'%(test1,test2)str4='%8.1e%8.1e'%(anorm,acond)print(str1,str2,str3,str4)ifistop!=0:break#Endofiterationloop.returnx,istop,itn,r1norm,r2norm,anorm,acond,arnorm,xnorm,var
================================
校验假设:
虽然上面的实例算出了线性关系,但是我们怎么判断我们开始预设立得假设呢,就是我们是否要推翻假设,毕竟是假设,
假设未来一周我有一个亿,根据现实偏离太大,估计会推翻假设。
算了还是切入正题吧:
开始我假设了个模型:
我们需要校验这个假设,我们还要附加假定:
对于X的每一个固定值,所以的ε都相互独立,并且都服从均值为0,方程为σ方 的正态分布
得到:
SSE为残差,n-2为自由度[自由度=观测个数-待估回归参数个数]
带入化简得倒,标准误差分别是:
的标准误差刻画了斜率的估计精度,标准误越小估计精度越高
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。