What do Vegans do in their Spare Time? Latent Interest Detection in Multi-Community Networks

Jack Hessel, Alexandra Schofield, Lillian Lee, David Mimno

Abstract

Most social network analysis works at the level of interactions between users. But the vast growth in size and complexity of social networks enables us to examine interactions at larger scale. In this work we use a dataset of 76M submissions to the social network Reddit, which is organized into distinct sub-communities called subreddits. We measure the similarity between entire subreddits both in terms of user similarity and topical similarity. Our goal is to find community pairs with similar userbases, but dissimilar content; we refer to this type of relationship as a "latent interest." Detection of latent interests not only provides a perspective on individual users as they shift between roles (student, sports fan, political activist) but also gives insight into the dynamics of Reddit as a whole. Latent interest detection also has potential applications for recommendation systems and for researchers examining community evolution.

So... What do they do in their spare time?

Our definition of spare time relates to "latent interests." In short, these lists are subreddits where vegans spend a lot of time not talking directly about veganism.

  1. AnimalRights
  2. Anarchism
  3. yoga
  4. VegRecipes
  5. Feminism
  6. environment
  7. philosophy
  8. gardening
  9. bicycling
  10. Buddhism

Here are some examples for liberals and conservatives, as well...

The top latent interests for liberals:
  1. California
  2. GunsAreCool
  3. Bad_Cop_No_Donut
  4. economy
  5. Feminism
  6. immigration
  7. RenewableEnergy
  8. energy
  9. newyork
  10. democrats
The top latent interests for conservatives:
  1. Bad_Cop_No_Donut
  2. guns
  3. Christianity
  4. Military
  5. economy
  6. Economics
  7. Catholicism
  8. progun
  9. climateskeptics
  10. religion
If you're interested in the latent interests of your favorite subreddit, you can download the full results list here. For each subreddit, there are the top latent interests (labeled "ALL") and the top latent interests with the top 10 most similar user-wise subreddits disregarded (labeled "JUST LATENT"). Results are presented here in two different ways because controlling for textual similarity still doesn't always produce novel latent interests when compared to a simple "most similar user" baseline.

Paper Link and Citation

This paper appeared in the NIPS 2015 Networks Workshop. The arXiv version is available here.

Dataset download

The readme of the dataset is reproduced here. The ~2GB download is available here.

A processed subset of data from Tan and Lee (2015). This is a companion dataset to "What do Vegans do in their Spare Time? Latent Interest Detection in Multi-Community Networks" by Jack Hessel, Alexandra Schofield, Lillian Lee, and David Mimno.

Included here are:

  1. finalSubs.txt -- a list of ~3.2K subs that both had enough users and enough text posts to analyze.
  2. redditTextBalanced.txt -- a list of just under 6.6M text posts from the 3.2K subs in finalSubs.txt. Subreddits have a maximum of 5K text posts in this dataset, but prefect "class balance" is not attained just with this limit.
  3. clusters.txt -- a set of 51 ground truth, potentially overlapping clusters. In the paper, we refer to "37" ground truth clusters. This number arises from from comparing this list to finalSubs.txt, and only considering the 37 clusters for which 4 of their members were in our final set of considered subreddits.
  4. userSims/
    1. jaccardSims-sparse.txt -- a list of pairs of subreddits and the jaccard similarity of their submitting-user sets. If a pair is not included in the list, the similarity is zero.
  5. textSims/
    1. *-out-all.graph: 1-JSD (as in equation(1) in our paper) between topic distributions for all pairs of subreddits in our considered set.

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 0910664, a Google Research grant, the Cornell Institute for the Social Sciences, and a Cornell University fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the supporting institutions.