The use of likes is interesting from a data analysis standpoint because it is only capable of a binary indication and the negative value is a confounding of don't care, don't like, and unexposed to the tag. This makes the data set fairly weak compared with other categories of data that are easily accessible online. On the upside, the number of tags is a limited analytical space which facilitates regression. Principle Component Analysis (PCA) would likely produce the same kinds of outputs with 20 or 30 variables rather than the 700+ used here. Also problematic was that they include profiles with only a single like which nessisarily hurts their predictive accuracy. It would have been interesting to look at the model with an increasing minimum number of likes to see when enough data has accumulated to form a useful model.
The article also demonstrates the problem of outliers. The correlation of liking Science with high intelligence probably reflects a general correlation, while the correlation of 'curly fries' with high intelligence is likely an outlier resulting from a few data points. One problem with data mining is that if you fail to move the p-value of correlations up significantly then you will detect more noise events in your data, while if you set it too high then you will overlook useful information. With 700 inputs, use of a p of 0.05 will result in 35 'significant' correlations by mere chance. The article uses p values of <0.001 but this still doesn't prevent the problem of spurious correlation.
More interesting is the predictive power when dealing with non-binary characteristics (e.g. age). Significant correlations were found with Intelligence, Extroversion, Openness, Emotional Stability, Agreeableness, etc. The problem is that given the very large number of measurements used, the threshold for significance is trivially low such that when compared with the natural variation is irrelevant. In this instance, given that n > 50,000 then a Pearson's correlation of 0.01 has a p value of less than 0.013. This indicates a problem of what significance means in a large data set.
Still it is an excellent insight into the type of data available from a moderate sized data set and we are unlikely to get publication of many sample sets such as this.
This is the study that I referenced in class. I think this is of particular interest when it comes to profiling (similar to what we saw in U.S. v. Sokolow). We could determine that, for profiling purposes, certain questions or data go beyond what we are willing to use against individuals as a matter of policy. But this study shows that even if government does not look for certain characteristics or ask certain questions, our own activity can indicate this type of information without being explicit. It certainly raises eyebrows about what our online activity says about us and how it could be used.
ReplyDelete