## Leitura da PNAD 2013 com o R

Com o erro recente na divulgação dos resultados da PNAD 2013, o nome do IBGE e também os resultados dessa pesquisa, chegaram na grande mídia de um forma muito negativa. Ainda assim, a maioria das pessoas desconhece o que é a PNAD, como esses dados são obtidos e como eles podem ser baixados e utilizados. Neste post vou fornecer uma breve explicação do que é PNAD, como esses dados são distribuídos (na forma de microdados) e como você pode facilmente obtê-los e utiliza-los a partir de ferramentas gratuitas como o R.

Mas como essa amostra é escolhida? Como uma amostra tão pequena, de menos de 0,3% do total de domicílios, pode representar dados confiáveis da população? A resposta a essa questão está relacionada à teoria de amostragem, algo que não vou tratar nesse post, mas vou apenas dizer que é possível, com a metodologia certa, obter informações confiáveis da população a partir de uma amostra tão pequena. O IBGE, além de disponibilizar os dados, também disponibiliza o procedimento amostral utilizado. No link citado, ao baixar o arquivo Metodologia.zip, na pasta descompactada, no arquivo “Notas Metodológicas Pesquisa Básica  2013.doc”, o processo de seleção da amostra é apresentado de forma detalhada. Basicamente , é uma amostra probabilística de domicílios, realizada em três estágios:

1) no primeiro estágio  os municípios são classificados em duas categorias, autorrepresentativos e não autorepresentativos, isto é, aqueles municípios que COM CERTEZA vão fazer parte da amostra e aqueles que PODEM fazer parte da amostra. Os municípios não autorepresentativos passam então por um processo de estratificação, e em cada estrato são selecionados alguns municípios, COM REPOSIÇÃO e proporcionalmente à população residente, de acordo com o Censo 2010.

2) No segundo estágio, nos municípios escolhidos no primeiro estágio, são selecionados os setores censitários, com reposição e com probabilidade proporcional à população residente de acordo com o Censo 2010.

Assim, após a coleta, o IBGE compila e oferta estes dados na forma de microdados. Mas o que são os microdados? Veja o seguinte exemplo:

Como pode-se ver, são os dados crus, sem uma separação clara entre os campos, tal que para a leitura dos microdados é necessário um dicionário que informa o início de cada campo, o tamanho do campo e qual é a variável. Além disso é necessário ter acesso ao questionário utilizado e um descritor das variáveis. Todas estas informações estão disponíveis no arquivo Dados.zip, tal que a partir dele é possível reproduzir o procedimento de leitura que eu vou apresentar.

Inicialmente, faça o download destes aquivos, e nas planilhas “Dicionário de variáveis de domicílios da Pesquisa Básica – 2013.xls”  E “Dicionário de variáveis de pessoas da Pesquisa Básica – 2013″, exporte para um CSV as três primeiras colunas: Posição Inicial, Tamanho, Código de variável. No meu caso salvei como dicdom.csv e dicpes.csv. Como você vai verificar, existem dois arquivos separados, um para os domicílios e outro para as pessoas (DOM2013.txt e PES2013.txt), tal que em cada domicílio pode haver informações relativas a uma pessoa (único morador) ou mais. Você vai notar que o arquivo de pessoas é razoavelmente maior.

Supondo que você está usando o R em um determinado diretório, e a pasta com os microdados, chamada nesse caso de Dados, está nesse mesmo diretório, execute o script:

e pronto! Segundos depois você vai ter lido todos os dados referentes aos domicílios e as pessoas. Veja que a leitura é muito rápida, e este procedimento é uma melhora de outra solução que você pode verificar aqui.

Qualquer dúvida com relação ao procedimento, entre em contato por meio dos comentários, e boas análises!

## Data Preparation – Part II

This time i will talk about how to deal with large text files in chuncks with R. Just to provide some real data to work with download data, relative to 1988; from now on i will work with this file.

To work with this data i will use  iterators package. This package allow you pass the file, line by line, or chunck by chunk, without really load all file to memory. As you can feel the idea try this code:

OK, now you have a connection to your file. Let’s create a iterator:

As you can see you are printing line by line. So, you can work with one line, or a chunk of data even with a large file. If you want to read line by line till the end of the file you can use something like this:

that returns a FALSE at the end of the file. This a very useful trick in data preparation with large text files.

## MOOCs and courses to learn R

Inspired by this article i thought about gather here all multimedia resources that i know to learn use R. Today there is a lot of online courses, some MOOC’s too, that offer reasonable resources to start with R.

I will just list the materials in sequence and offer my evaluation about them. Of course your evaluation can be different; this case fell free to comment. In the future i can update the material. Let’s begin:

This course was offered multiple times from 2012 to 2014. I did the course at 2013 and the course was very well formatted, with good exercises and lots of resources. It was offered on Coursera platform, that i particularly think excellent, and take an average effort to finish. There is quiz questions and program assignments. The course is free.

This course is offered at coursera too, from the same instructor as Computing for Data Analysis. According to course syllabus: “The course covers practical issues in statistical computing which includes programming in R, reading data into R, accessing R packages, writing R functions, debugging, and organizing and commenting R code. Topics in statistical data analysis and optimization will provide working examples.”

Well, once that there is now the coursera specializations, i thinks that this course is “Computing for Data Analysis” rebuilded. SO, if you already did the first course i don’t see any advantage doing this one, unless you want to get the specialization certificate.

This course was about the practical aspect of data analysis with R. All activities was in R, but the course wasn’t about R itself. But it was very good and you can learn a lot of R with it. The course is free and offered on Cousera platform.

OBS: Both, Computing for Data Analysis and Data Analysis, can be replaced for other courses on coursera specialization. I won’t comment now about these new courses because some of them are being offered now and other will be offered on the  next months. Anyway both course materials are avaliable on youtube.

This course is being offered at bigdatauniversity platform. The course is a good starting point but i don’t this platform like so much. But the pros of this course is that you can do it on your pace and it have less material to cover. It’s a quick introduction  to R.

This course will be too offered at bigdatauniversity, but it’s not avaliable right now. It will be recorded, and you can participate through live stream from bigdatauniversity, and released on that platform. The course doesn’t have a syllabus available, but i think that the content of this book, will probably be the topics discussed.

Rattle is a GUI for datamining that uses R as backend. It’s very intuitive and resembles Weka interface. While it’s not as flexible as use R directly it provides a quick way to explore and buid models with R. With Rattle() every step taken is saved on a log that you can use as scripts to automate tasks.

This course is too being offered at bigdatauniversity. It’s supposing you have some R skills and is about the use of databases with R. It uses IBM DB2 specifically, but you can  apply the concepts to others databases as well. It’s free and i liked it.

I’m taking this course right now and i will write a post specifically about it. BUT, just to clarify, it’s a course about machine learning (or statiscical learning if you want) based on the introductory book

The course isn’t about R itself but all the techniques are implemented using R. This course is a easy way to read the book. The course is free and is offered at openedX platform, that is a wonderful platform for MOOCs.

This course is offered at Udemy platform. Udemy is a great platform both to create your own course as to take a course. The course is not free, but it has good reviews of users and you have 30 days to evaluate the material and be refounded.

The author claim that you can learn the basics with 2 days of course.

This course will be released at september on edX. So no further comments.

This is another Udemy course. Worth of checking.

Obs: Coursera will offer Developing Data Products for free next june. This course is about the very same topics. So maybe it’s worth wait.

## Genetic data, large matrices and glmnet()

Recently talking to a colleague, had contact with a problem that I had never worked with before: modeling with genetic data. I have no special knowledge of the subject, but taking a look at some articles in the area knew that one of the most used techniques for this type of data was the lasso.

In R, one of the most used packages for the lasso is glmnet, which unlike most other packages like lm accepts as input a data.frame. So, before you start modeling, you must perform a pre-processing step passing the data to matrix format. Done it is possible to pass a formula or even passing an array with the response variable, plus another with data for the variables.

The problem with the formula approach is that, in general, genomic data has more columns than observations. The data that I worked in that case had 40,000 columns and only 73 observations. In order to create a small set of test data, run the following code:

So, with this data set we will try to fit a model with glmnet ():

And if you do not have a computer with more RAM than mine, you will probably leak memory and give a crash in R. The solution? My first idea was to try sparse.model.matrix() that creates a sparse matrix model using the same formula. Unfortunately did not work, because even with sparse matrix, the final model is still too big! Interestingly, this dataset occupies only 24MB from RAM, but when you use the model.matrix the result is an array with more than 1Gb.

The solution I found was to build the matrix on hand. To do this we encode the array with dummy variables, column by column, and store the result in a sparse matrix. Then we will use this matrix as input to the model and see if it will not leak memory:

NOTE: Pay attention to how we are using a sparse matrix the Matrix package is required. Also note that the columns are connected using cBind () instead of cbind ().

The matrix thus generated was much lower: less than 70 Mb when I tested. Fortunately glmnet () supports a sparse matrix and you can run the model:

So you can create models with this type of data without blowing the memory and without use R packages for large datasets like bigmemory and ff.

## Data Preparation – Part I

The R language provides tools for modeling and visualization, but is still an excellent tool for handling/preparing data. As C++ or python, there is some tricks that bring performance, make the code clean or both, but especially with R these choices can have a huge impact on performance and the “size” of your code. A seasoned R user can manage this effectively, but this can be a headache to a new user. SO, in this series of posts i will present some data preparation techniques that anyone should know about, at least the ones i know!

1. Using apply, lappy, tapply

Sometimes the apply’s can make your code faster, sometimes just cleaner. BUT the fact is that, at least in R, is recommended avoid for loops. So, instead of using loops, you can iterate over matrixes, lists and vectors using these functions. As an example see this code:

matriz <- matrix(round(runif(9,1,10),0),nrow=3)
apply(matriz, 1, sum) ## sum by row
apply(matriz, 2, sum) ## sum by column


Particularly in this example there is no gain on performance, but you get a cleaner code.

Talking about means, sometimes tapply can be very usefull in this regard. Let’s say you want to get means by group, you can have this with one line too. For example, considering the mtcars dataset:

so

tapply(mtcars$hp, mtcars$cyl, mean)


and you can have the mean power by cylinder capacity. This function is very usefull on descriptive analysis. BUT sometimes you have lists, not vectors. In this case just use lappy or sapply (simplify the output). Let’s generate some data:

lista <- list(a=c('one', 'tow', 'three'), b=c(1,2,3), c=c(12, 'a'))


Each element of this list is a vector. Let’s say you want to know how many elements there is in each vector:

lapply(lista, length) ## return a list
sapply(lista, length) ## coerce to a vector


2. Split, apply and recombine

This technique you must know. Basically we split the data, apply a function and combine the results. There is a package created with this in mind. But we will use just base R functions: split, *apply and cbind() ou rbind() when needed. Looking again at mtcars dataset, let’s say we want fit a model of mpg against disp, grouped by gears,  and compare the regression coefficients.

data <- split(mtcars, mtcars$gear) ## split fits <- lapply(data, function(x) return(lm(x$mpg~x$disp)$coef)) ## apply
do.call(rbind, fits) ## recombine


This technique is powerfull. You can use at different contexts.

Next part i will talk about some tricks with dates.