Dataset to accompany "Science, AskScience, and BadScience: On the
Coexistence of Highly Related Communities" (ICWSM 2016).

If you use this dataset in your research, please give credit to Jason
Baumgartner of pushshift.io for his scraping work.

If you use the reconstructed discussion trees provided here, please
cite the following article:

@inproceedings{hessel2016science,
  title={Science, AskScience, and BadScience: On the Coexistence of Highly Related Communities},
  author={Hessel, Jack and Tan, Chenhao and Lee, Lillian},
  journal={The 10th International AAAI Conference on Web and Social Media},
  year={2016}
}

Update, April 2018: Version 1.0 -> 1.1

- A small amount of data was missing from this dataset. While we were
  aware of this fact when we published V1.0 (see the "stats" files
  below) recently the negligibility of this missing information has
  been called into question. More information is available here:

  http://devingaffney.com/caveat-emptor-computational-social-science-large-scale-missing-data-in-a-widely-published-reddit-corpus/

  To correct this, we re-scraped all post/comment gaps in March 2018
  (we scraped each possible missing ID 3 times to ensure that missing
  posts/comments were not missing due to intermittent API errors;
  compared to version 1.1, there are an additional 1.3M posts (missing
  rate of 1.1%) and 2.6M comments until 2015 (missing rate of .125%).

  We re-ran key experiments from this paper with the missing data
  incorporated (in particular -- re-doing the pairing experiments
  giving rise to Figures 6 and 7) and the results are robust. See
  http://www.cs.cornell.edu/~jhessel/reddit/gaps.html for more detail.

- Removed "score" from the meta files; it is not needed for any of the
  results in the paper, and its semantics have changed on reddit
  itself.

- Added *-dangling files, which describe the number of dangling
  references per community.

~~~

Files: (10GB)

ICWSM2016.tar.bz2 contains two files for each of ~5.7K subreddits. The
subreddits selected follow the filtration process detailed in "What do
Vegans do in their Spare Time? Latent Interest Detection in
Multi-Community Networks" (Hessel et al. 2015; NIPS networks
workshop). Specifically, communities are only included in this dataset
if at least 300 unique users have submitted text or link posts to that
subreddit.

- X.jsonlist-meta: one line per post/comment. Data on the line is of
  the form

  (spaces) username utc_timestamp post_id

  The number of leading spaces on a line indicates the depth of the
  post/comment. Normal reddit submissions have zero leading spaces,
  first level comments have one leading space, etc. Consider the
  following example:

SalamiJack 1333520401 t3_rsiqs
 alyvian 1333559101 t1_c48fmft
  sirhotalot 1333573483 t1_c48ixz5
 nonesaid 1333562944 t1_c48gie3

 In this case, /u/SalamiJack started a thread, to which /u/alyvian and
 /u/nonesaid replied directly. /u/sirhotalot replied to /u/alyvian's
 reply. A small number of reddit users may have a space in their name, so
 please keep this in mind when parsing these files.

- X.jsonlist-stats: a python prettytable formatted summary of
  estimates of extents of scraping errors. The posts and comments are
  from different data sources (posts are from Tan and Lee (WWW, 2015)
  comments are from Jason Baumgartner's reddit post (see below)). The
  posts dataset has a "num_comments" field that provides a noisy
  estimate of how many comments should be collected from the comments
  dataset. These are the number of "comments expected." The rest of
  this file contains statistics regarding the extent to which the
  number of comments we expect exceed the number of comments actually
  present. Most subreddits have negligible missing data.

- X.jsonlist-dangling: We count the number of "dangling references"
  per community. A dangling reference is discovered as follows: each
  comment has a pointer to the ID of its parent (which can be a post
  or a comment). If a comment's parent cannot be found, this means we
  are sure that it at one point existed, but cannot be recovered. Each
  *-dangling file contains two values -- the number of dangling
  comments due to missing posts, and then, the number of dangling
  comments due to missing comments. If a comment is dangling, it and
  (and all of its children) are discarded. A vast majority of
  subreddits we consider have zero dangling references.


History:

This dataset includes meta information from roughly 1 billion posts +
comments from the website www.reddit.com, dating back to reddit's
inception in 2006 to Nov. 2014. It was constructed using data provided
by Jason Baumgartner (pushshift.io) who has done an excellent job
scraping reddit. You can read more about the comments dataset here
(Baumgartner's reddit username is /u/Stuck_In_The_Matrix)

https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/

The data contributions of the authors are the reconstructions of the
entire discussion trees. While Baumgartner's dataset includes a list
of all comments, significant processing was required to reconstruct
entire discussion trees. In its raw form, each comment simply contains
a pointer to its parent.

The list of posts to reconstruct discussions for was taken from "All
Who Wander: On the Prevalence and Characteristics of Multi-community
Engagement" (WWW 2015). This version of the dataset has no
comments. However, please consider citing the following article
instead if you use only the post information from this dataset.

@inproceedings{tan+lee:15, 
     author = {Chenhao Tan and Lillian Lee}, 
     title = {All Who Wander: On the Prevalence and Characteristics of Multi-community Engagement}, 
     year = {2015}, 
     booktitle = {Proceedings of WWW} 
}

More information about Tan and Lee 2015 can be found here:

https://chenhaot.com/pages/multi-community.html

In 2018, Gaffney and Matias pointed out that there was potentially a
small amount of data missing from the set. In 2018, we re-scraped post
and comment gaps in accordance with the sequential reddit ID analysis
of Gaffney and Matias (https://arxiv.org/pdf/1803.05046.pdf). The
meaning of the score of posts changed on reddit since the initial
scraping period (http://redd.it/5gvd6b) and, as a result, we have
omitted scoring information from the version 1.1 release.

Additional Information:

This dataset contains only the meta-data for posts and
comments. However, the full text of each comment is also
available. The entire dataset is around 120GB compressed, and this
study uses none of the comment text.

If you're interested in gaining access to the whole dataset, contact
jhessel@cs.cornell.edu.