What do Vegans do in their Spare Time? Latent Interest Detection in Multi-Community Networks
Jack Hessel, Alexandra Schofield, Lillian Lee, David Mimno
Abstract
Most social network analysis works at the level of interactions
between users. But the vast growth in size and complexity of social
networks enables us to examine interactions at larger scale. In this
work we use a dataset of 76M submissions to the social network Reddit,
which is organized into distinct sub-communities called subreddits. We
measure the similarity between entire subreddits both in terms of user
similarity and topical similarity. Our goal is to find community pairs
with similar userbases, but dissimilar content; we refer to this type
of relationship as a "latent interest." Detection of latent interests
not only provides a perspective on individual users as they shift
between roles (student, sports fan, political activist) but also gives
insight into the dynamics of Reddit as a whole. Latent interest
detection also has potential applications for recommendation systems
and for researchers examining community evolution.
So... What do they do in their spare time?
Our definition of spare time relates to "latent interests." In
short, these lists are subreddits where vegans spend a
lot of time not talking directly about veganism.
- AnimalRights
- Anarchism
- yoga
- VegRecipes
- Feminism
- environment
- philosophy
- gardening
- bicycling
- Buddhism
Here are some examples for liberals and conservatives, as well...
The top latent interests for liberals:
- California
- GunsAreCool
- Bad_Cop_No_Donut
- economy
- Feminism
- immigration
- RenewableEnergy
- energy
- newyork
- democrats
The top latent interests for conservatives:
- Bad_Cop_No_Donut
- guns
- Christianity
- Military
- economy
- Economics
- Catholicism
- progun
- climateskeptics
- religion
If you're interested in the latent interests of your favorite subreddit, you can download the full results list here. For each subreddit, there are the top latent interests (labeled "ALL") and the top latent interests with the top 10 most similar user-wise subreddits disregarded (labeled "JUST LATENT"). Results are presented here in two different ways because controlling for textual similarity still doesn't always produce novel latent interests when compared to a simple "most similar user" baseline.
Paper Link and Citation
This paper appeared in the
NIPS 2015
Networks Workshop. The arXiv version is available here.
Dataset download
The readme of the dataset is reproduced here. The ~2GB download is available here.
A processed subset of data from Tan and Lee (2015). This is a companion dataset to "What do Vegans do in their Spare Time? Latent Interest Detection in Multi-Community Networks" by Jack Hessel, Alexandra Schofield, Lillian Lee, and David Mimno.
Included here are:
- finalSubs.txt -- a list of ~3.2K subs that both had enough users and enough text posts to analyze.
- redditTextBalanced.txt -- a list of just under 6.6M text posts from the 3.2K subs in finalSubs.txt. Subreddits have a maximum of 5K text posts in this dataset, but prefect "class balance" is not attained just with this limit.
- clusters.txt -- a set of 51 ground truth, potentially overlapping clusters. In the paper, we refer to "37" ground truth clusters. This number arises from from comparing this list to finalSubs.txt, and only considering the 37 clusters for which 4 of their members were in our final set of considered subreddits.
- userSims/
- jaccardSims-sparse.txt -- a list of pairs of subreddits and the jaccard similarity of their submitting-user sets. If a pair is not included in the list, the similarity is zero.
- textSims/
- *-out-all.graph: 1-JSD (as in equation(1) in our paper) between topic distributions for all pairs of subreddits in our considered set.
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant No. 0910664, a Google Research grant, the Cornell Institute for the Social Sciences, and a Cornell University fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the supporting institutions.