Probabilidad y Estadística: Tema 5.-Regresión lineal: 2018

miércoles, 30 de mayo de 2018

Bienvenida y comentarios acerca de la creacion del blog

Este blog fue creado con la intención de ofrecer la información, formulas, métodos, y análisis de los temas tratados para la solución de los ejercicios planteados aquí mismo.

Durante su creación se presentaron algunas dificultades que impedían proseguir, por ejemplo el desconocimiento o la complejidad de desarrollar algún subtema y resolver los ejercicios propuestos, aunque cabe señalar que gracias a la consulta de algunos libros que ofrece la institución mismos que se encuentran señalados en el apartado bibliográfico, ademas de la consulta al asesor de la materia se logro satisfactoriamente el objetivo.

También es importante dar agradecimiento a los dueños de los canales de Youtube de los cuales fueron tomados algunos vídeos explicativos de ejercicios.

Como conclusión final resaltar la clara diferencia de aprendizaje y conocimiento (sobretodo por parte de los autores ) entre el antes de empezar y al final al presentar el blog, donde la mayoría mostraron cierto dominio sobre el tema y se logro una buena experiencia.

Clarifications about the English version

From this post the same content of the blog is offered in the English language, this with the purpose of expanding the coverage of it and offering more students or common people the information presented here, adapting completely even the images, and only omitting the videos presented

Enjoy the content in English, and in case you find any error in the translation do not hesitate to leave the comment to correct the error as soon as possible and improve the content

5.1.7 Measurement errors

The measurement error is defined as the difference between the measured value and the "true value". Measurement errors affect any measuring instrument and can be due to different causes. Those that can be predicted in some way, calculated, eliminated by calibrations and compensations, are called deterministic or systematic and are related to the accuracy of the measurements. Those that can not be predicted, because they depend on unknown causes, or stochastic, are called random and are related to the precision of the instrument.

Random error The laws or mechanisms that cause it due to its excessive complexity or its small influence on the final result are not known.

To know this type of errors we must first make a sampling of measurements. With the data of the successive measurements we can calculate its mean and sample standard deviation.

Systematic error They remain constant in absolute value and in the sign when measuring, a magnitude in the same conditions, and the laws that cause it are known.

To determine the systematic error of the measurement, a series of measurements must be made on a quantity Xo, the arithmetic mean of these measurements must be calculated and then the difference between the mean and the magnitude X0 must be found.

Systematic error = | media - X0 |

Although it is impossible to know all the causes of the error, it is convenient to know all the important causes and have an idea that allows us to evaluate the most frequent errors. The main causes that produce errors can be classified as:

Error due to the measuring instrument.
Error due to the operator.
Error due to environmental factors.
Error due to geometric tolerances of the piece itself.

5.1.6 Confidence intervals and tests for the correlation coefficient

In statistics, it is called confidence interval to a pair or several pairs of numbers between which it is estimated that there will be a certain unknown value with a certain probability of success. Formally, these numbers determine a range, which is calculated from data from a sample, and the unknown value is a population parameter.

The probability of success in the estimation is represented by 1 - α and is called confidence level. In these circumstances, α is the so-called random error or level of significance, that is, a measure of the possibilities of failure in the estimation by such an interval.

Use the confidence interval to evaluate the estimation of the population parameter. For example, a manufacturer wants to know if the average length of the pencils he produces is different from the target length. The manufacturer takes a random sample of pencils and determines that the average length of the sample is 52 millimeters and the confidence interval of 95% is (50.54). Therefore, you can be 95% sure that the average length of all pencils is between 50 and 54 millimeters.

5.1.5 Two-dimensional normal distribution

5.1.5 Two-dimensional normal distribution

In statistics, the binomial distribution is a discrete probability distribution that counts the number of successes in a sequence of n independent Bernoulli trials, with a fixed probability p of occurrence of success between trials.

A Bernoulli experiment is characterized by being dichotomous, that is, only two results are possible. One of these is called "success" and has a probability of occurrence p and the other, "failure", with a probability q = 1 - p. In the binomial distribution, the experiment is repeated n times, independently, and the probability of a certain number of successes is calculated.

To represent that a random variable X follows a binomial distribution of parameters n and p, it is written:

Its probability function is

where

5.1.4 Linear correlation coefficient

5.1.4 Linear correlation coefficient

The linear correlation coefficient is the quotient between the covariance and the product of the standard deviations of both variables.

Properties
1.
The correlation coefficient does not change when the measurement scale does it.
That is, if we express the height in meters or in centimeters, the correlation coefficient does not change.

2.
The sign of the correlation coefficient is the same as that of the covariance.

3.
The linear correlation coefficient is a real number between -1 and 1.

Four.
If the linear correlation coefficient takes values close to -1, the correlation is strong and inverse, and will be stronger the closer a r approaches -1.

5.
If the linear correlation coefficient takes values close to 1, the correlation is strong and direct, and will be stronger the closer a r approaches.

6
If the linear correlation coefficient takes values close to 0, the correlation is weak.

7
If r = 1 or -1, the points of the cloud are on the increasing or decreasing line. Between both variables there is functional dependence.

5.1.3 Correlation

5.1.3 Correlation

By definition, the correlation is the correspondence or relationship between two or more things, in statistics, the degree of dependence between random variables that intervene in a multidimensional distribution. It is that which indicates the force and the linear direction that is established between two random variables.

It is considered that two variables of a quantitative type have a correlation with each other when the values of one of them vary systematically with respect to the homonymous values of the other. For example, if we have two variables that are called A and B, there will be the aforementioned correlation phenomenon if increasing the values of A are also the values of B and vice versa

5.1.2 Simple Linear Regression

5.1.2 Simple Linear Regression

The objective of a regression model is to try to explain the relationship that exists between a dependent variable, (response variable) and a set of independent variables, (explanatory variables)
In the simple linear regression model, we try to explain the relationship that exists between the response variable AND a single explanatory variable X.

Y = α + βX + ε

Where α is the ordinate at the origin (the value that Y takes when X is 0)

β is the slope of the line, (and indicates how Y changes by increasing X by one unit)
ε is a variable that includes a large set of factors, each of which influences the response only in a small amount to what is called "error".

ESTIMATION OF THE REGRESSION STRAIGHT BY THE MINIMUM SQUARE METHOD

First, we will proceed to represent the scatter diagram, or point cloud. Suppose it is the one obtained in the figure. Although the cloud reveals a large dispersion, we can observe a certain linear tendency by increasing X and Y (a trend that is not entirely accurate, for example, if we assume that X is age and Y is the size, obviously, not only the size it depends on the age, in addition there can also be measurement errors).

The regression line should have a mid-line character, it should fit well with most of the data, that is, it should pass as close as possible to all the points, that you have little of each and every one of them means that we should adopt a particular criterion that is generally known as SQUARE MINIMUM. This criterion means that the sum of the squares of the vertical distances of the points to the line must be as small as possible.

5.1.1 Scatter diagrams

5.1.1 Scatter diagrams

The scatter diagram is a graphical tool that helps identify the possible relationship between two variables. Represents the relationship between two variables graphically, which makes it easier to visualize and interpret the data. is a type of mathematical diagram that uses the Cartesian coordinates to show the values of two variables for a set of data.

The dispersion diagram allows to analyze if there is any kind of relationship between two variables. For example, it can happen that two variables are related so that increasing the value of one increases the value of the other. In this case we would talk about the existence of a positive correlation. It could also happen that when one occurs in one direction, the other derives in the opposite direction; for example, by increasing the value of the variable x, reduce that of the variable y. Then, there would be a negative correlation. If the values of both variables are revealed independent of each other, it would be affirmed that there is no correlation.

One of the most powerful aspects of a scatter plot, however, is its ability to show the non-linear relationships between the variables. Furthermore, if the data is represented by a simple relationship mixing model, these relationships are visually evident as overlapping patterns.

martes, 29 de mayo de 2018

5.1.7 Errores de medición

El error de medición se define como la diferencia entre el valor medido y el "valor verdadero". Los errores de medición afectan a cualquier instrumento de medición y pueden deberse a distintas causas. Las que se pueden de alguna manera prever, calcular, eliminar mediante calibraciones y compensaciones, se denominan deterministas o sistemáticos y se relacionan con la exactitud de las mediciones. Los que no se pueden prever, pues dependen de causas desconocidas, o estocásticas se denominan aleatorios y están relacionados con la precisión del instrumento.

Error aleatorio. No se conocen las leyes o mecanismos que lo causan por su excesiva complejidad o por su pequeña influencia en el resultado final.

Para conocer este tipo de errores primero debemos realizar un muestreo de medidas. Con los datos de las sucesivas medidas podemos calcular su media y la desviación típica muestra.

Error sistemático. Permanecen constantes en valor absoluto y en el signo al medir, una magnitud en las mismas condiciones, y se conocen las leyes que lo causan.

Para determinar el error sistemático de la medición se deben de realizar una serie de medidas sobre una magnitud Xo, se debe de calcular la media aritmética de estas medidas y después hallar la diferencia entre la media y la magnitud X0.

Error sistemático = | media - X0 |

Aunque es imposible conocer todas las causas del error es conveniente conocer todas las causas importantes y tener una idea que permita evaluar los errores más frecuentes. Las principales causas que producen errores se pueden clasificar en:

Error debido al instrumento de medida.

Error debido al operador.

Error debido a los factores ambientales.

Error debido a las tolerancias geométricas de la propia pieza.

5.1.6 Intervalos de confianza y pruebas para el coeficiente de corelacion

5.1.6 Intervalos de confianza y pruebas para el coeficiente de corelación

En estadística, se llama intervalo de confianza a un par o varios pares de números entre los cuales se estima que estará cierto valor desconocido con una determinada probabilidad de acierto. Formalmente, estos números determinan un intervalo, que se calcula a partir de datos de una muestra, y el valor desconocido es un parámetro poblacional.

La probabilidad de éxito en la estimación se representa con 1 - α y se denomina nivel de confianza. En estas circunstancias, α es el llamado error aleatorio o nivel de significación, esto es, una medida de las posibilidades de fallar en la estimación mediante tal intervalo.

Utilice el intervalo de confianza para evaluar la estimación del parámetro de población. Por ejemplo, un fabricante desea saber si la longitud media de los lápices que produce es diferente de la longitud objetivo. El fabricante toma una muestra aleatoria de lápices y determina que la longitud media de la muestra es 52 milímetros y el intervalo de confianza de 95% es (50,54). Por lo tanto, usted puede estar 95% seguro de que la longitud media de todos los lápices se encuentra entre 50 y 54 milímetros.

Ejercicios de Distribución bidimensional

Considerando las formulas mostradas anteriormente, realice el ejercicio presentado.

siendo

Ejercicio 1.

Se lanza una moneda 37 veces y se quiere conocer la probabilidad de que caiga sol 18 veces

lunes, 28 de mayo de 2018

5.1.5 Distribución normal bidimensional

En estadística, la distribución binomial es una distribución de probabilidad discreta que cuenta el número de éxitos en una secuencia de n ensayos de Bernoulli independientes entre sí, con una probabilidad fija p de ocurrencia del éxito entre los ensayos.

Un experimento de Bernoulli se caracteriza por ser dicotómico, esto es, solo dos resultados son posibles. A uno de estos se denomina «éxito» y tiene una probabilidad de ocurrencia p y al otro, «fracaso», con una probabilidad q = 1 - p. En la distribución binomial el experimento se repite n veces, de forma independiente, y se trata de calcular la probabilidad de un determinado número de éxitos.

Para representar que una variable aleatoria X sigue una distribución binomial de parámetros n y p, se escribe:

Su función de probabilidad es

donde

siendo

las combinaciones de

(

elementos tomados de

)

Ejercicios Coeficiente de correlacion

La formula para sacar el coeficiente de correlacion (denominado r) es:

esto quiere decir "covarianza sobre el producto de las variaciones típicas de X e Y

Nota: como puede observarse, para calcular el coeficiente de correlación hace falta antes conocer la covarianza y las variaciones típicas de este modo:

Para desviación típica

Para covarianza

Ahora si se presentan dos Ejercicios

Ejercicio N° 1:

X	Y
1	10
2	17
3	30
4	28
5	39
6	47

Una empresa de publicidad tiene la siguiente distribución de datos donde x=número de anuncios publicitarios transmitidos e Y= al número de ventas conseguidas, Se desea saber el coeficiente de correlación entre X e Y

Ejercicio N° 2

X	Y
2	1
3	3
4	2
4	4
5	4
6	4
6	6
7	4
7	6
8	7
10	9
10	10

Las notas de 12 alumnos de un grupo de secundaria en dos materias diferentes son:

Donde X corresponde a las calificaciones en matemáticas e Y a Fisica

*Calcular el coeficiente de correlación.

5.1.4 Coeficiente de correlación lineal

El coeficiente de correlación lineal es el cociente entre la covarianza y el producto de las desviaciones típicas de ambas variables.

Vídeo del calculo de coeficiente de correlación lineal y análisis

Propiedades

El coeficiente de correlación no varía al hacerlo la escala de medición.

Es decir, si expresamos la altura en metros o en centímetros el coeficiente de correlación no varía.

El signo del coeficiente de correlación es el mismo que el de la covarianza.

El coeficiente de correlación lineal es un número real comprendido entre −1 y 1.

Si el coeficiente de correlación lineal toma valores cercanos a −1 la correlación es fuerte e inversa, y será tanto más fuerte cuanto más se aproxime r a −1.

Si el coeficiente de correlación lineal toma valores cercanos a 1 la correlación es fuerte y directa, y será tanto más fuerte cuanto más se aproxime r a 1.

Si el coeficiente de correlación lineal toma valores cercanos a 0, la correlación es débil.

Si r = 1 ó −1, los puntos de la nube están sobre la recta creciente o decreciente. Entre ambas variables hay dependencia funcional.

5.1.3 Correlación

5.1.3 Correlación

Por definición la correlación es la correspondencia o relación que mantienen dos o más cosas entre sí, en estadística, el grado de dependencia entre variables aleatorias que intervienen en una distribución multidimensional. Es aquello que indicara la fuerza y la dirección lineal que se establece entre dos variables aleatorias.

Se considera que dos variables de tipo cuantitativo presentan correlación la una respecto a la otra cuando los valores de una de ellas varíen sistemáticamente con respecto a los valores homónimos de la otra. Por ejemplo, si tenemos dos variables que se llaman A y B, existirá el mencionado fenómeno de correlación si al aumentar los valores de A lo hacen también los valores de B y viceversa.

5.1.2 Regresión Lineal simple

5.1.2 Regresión Lineal simple

El objetivo de un modelo de regresión es tratar de explicar la relación que existe entre na variable dependiente, (variable de respuesta) y un conjunto de variables independientes, (variables explicativas)

En el modelo de regresión lineal simple se trata de explicar la relación que existe entre la variable de respuesta Y una única variable explicativa X.

Y=α+βX+ε

En donde α es la ordenada en el origen (el valor que toma Y cuando X vale 0)

β es la pendiente de la recta, (e indica cómo cambia Y al incrementar X en una unidad)

ε es una variable que incluye un conjunto grande de factores, cada uno de los cuales influye en la respuesta solo en pequeña magnitud a la que se le llama “error”.

ESTIMACIÓN DE LA RECTA DE REGRESIÓN POR EL MÉTODO DE LOS MÍNIMOS CUADRADOS

En primer lugar, procederemos a representar el diagrama de dispersión, o nube de puntos. Supongamos que es la obtenida en la figura. Aunque la nube revele una gran dispersión, podemos observar una cierta tendencia lineal al aumentar X e Y (tendencia que no es del todo exacta; por ejemplo, si suponemos que X es la edad e Y es la talla, obviamente, la talla no sólo depende de la edad, además también puede haber errores de medida).

La recta de regresión debe tener carácter de línea media, debe ajustarse bien a la mayoría de los datos, es decir, que pase lo más cerca posible de todos los puntos, que diste poco de todos y cada uno de ellos significa que hemos de adoptar un criterio particular que en general se conoce como MÍNIMOS CUADRADOS. Este criterio significa que la suma de los cuadrados de las distancias verticales de los puntos a la recta debe ser lo más pequeña posible.

Probabilidad y Estadística: Tema 5.-Regresión lineal