DataSets

Introduction

Beta-Recsys provides users a wide range of datasets for recommendation system training. For convenience, we preprocess a number of datasets for you to train, getting you rid of splitting them on you local machine. Also this framework provides users a set of useful interfaces for data split.


Usage

The following codes can automatically download the Movielens_100k dataset, and split it using the leave-one-out splitting strategy.

from beta_rec.datasets.movielens import Movielens_100k
dataset = Movielens_100k()
split_dataset = dataset.load_leave_one_out()

To clean the dataset by filtering before splitting, you can use the filtering parameters. E.g., to filter out users that have less then 30 items and items that have less then 15 records, you can run:

dataset = Movielens_100k(min_u_c=15, min_i_c=30)

with these filtering parameters showing as follows:

min_u_c: filter the items that were purchased by less than min_u_c users.
(default: :obj:`0`)
min_i_c: filter the users that have purchased by less than min_i_c items.
(default: :obj:`3`)
min_o_c: filter the users that have purchased by less than min_o_c orders.
(default: :obj:`0`)

By default, the testing set will sample 100 negative items to reduce the evaluation cost. To reduce the bias of certain negative items, each splitting strategy will generate 10 different validation and testing sets. You can also specify these parameters:

split_dataset = dataset.load_leave_one_out(n_test=15,n_negative=200)

If you want to use the data splits generate by our Beta team, you can specify the download parameter.

from beta_rec.datasets.movielens import Movielens_100k
dataset = Movielens_100k()
split_dataset = dataset.load_leave_one_out(download=True)

For some very large datasets, generating negative items can be time-costly. This feature can greatly reduce some repeated work, and provide a benchmarking.

Note: 25, Oct. 2020. Now the preprocessed splits of each dataset are a bit out-of-date, we will regenerate a new version as soon as possible.

Dataset Statistics

Here we present some basic staticstics for the datasets in our framework.

Dataset Interactions Baskets Temporal
MovieLens-100K ✔️ ✖️ ✔️
MovieLens-1M ✔️ ✖️ ✔️
MovieLens-25M ✔️ ✖️ ✔️
Last.FM ✔️ ✖️ ✖️
Epinions ✔️ ✖️ ✖️
Tafeng ✔️ ✖️ ✔️
Dunnhumby ✔️ ✔️ ✔️
Instacart ✔️ ✖️ ✔️
citeulike-a ✔️ ✖️ ✖️
citeulike-t ✔️ ✖️ ✖️
HetRec MoiveLens ✔️ ✖️ ✔️
HetRec Delicious ✔️ ✔️ ✖️
HetRec LastFM ✔️ ✔️ ✔️
Yelp ✔️ ✖️ ✔️
Gowalla ✔️ ✖️ ✔️
Yoochoose ✔️ ✖️ ✔️
Diginetica ✔️ ✖️ ✔️
Taobao ✔️ ✖️ ✔️
Ali-mobile ✔️ ✖️ ✔️
Retailrocket ✔️ ✖️ ✔️
Amazon Reviews ✔️

Because some split methods require a specific features, like random_basket expect the dataset has a Basket column. Here we list all the split methods for each dataset.

The prerequisite for each split methods are:

  • leave_one_out: none
  • leave_one_basket: require a Basket column in dataset
  • random: none
  • random_basket: require a Basket column in dataset
  • temporal: require a Timestamp(Temporal) column in dataset
  • temporal_basket: require a Timestamp(Temporal) and a Basket column in dataset
Dataset leave_one_out leave_one_basket random random_basket temporal temporal_basket
MovieLens-100K ✔️ ✖️ ✔️ ✖️ ✔️ ✖️
MovieLens-1M ✔️ ✖️ ✔️ ✖️ ✔️ ✖️
MovieLens-25M ✔️ ✖️ ✔️ ✖️ ✖️
Last.FM ✔️ ✖️ ✔️ ✖️ ✖️ ✖️
Epinions ✔️ ✖️ ✔️ ✖️ ✖️ ✖️
Tafeng ✔️ ✖️ ✔️ ✖️ ✔️ ✖️
Dunnhumby ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Instacart ✔️ ✖️ ✔️ ✖️ ✔️ ✖️
citeulike-a ✔️ ✖️ ✔️ ✖️ ✖️ ✖️
citeulike-t ✔️ ✖️ ✔️ ✖️ ✖️ ✖️
HetRec MoiveLens ✔️ ✖️ ✔️ ✖️ ✔️ ✖️
HetRec Delicious ✔️ ✔️ ✔️ ✖️ ✖️ ✖️
HetRec LastFM ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
Yelp ✔️ ✖️ ✔️ ✖️ ✖️
Gowalla ✔️ ✖️ ✔️ ✖️ ✖️
Yoochoose ✔️ ✖️ ✔️ ✖️ ✖️
Diginetica ✔️ ✖️ ✔️ ✖️ ✖️
Taobao ✔️ ✖️ ✔️ ✖️ ✖️
Ali-mobile ✔️ ✖️ ✔️ ✖️ ✖️
Retailrocket ✔️ ✖️ ✔️ ✖️ ✖️
Amazon Reviews

Also, we provide some information about the dataset content such as the number of items, users and so on. This may give you a brief view of the dataset.

Dataset #Interactions #User #Item #Rating #Timestamp
MovieLens-100K 100,000 943 1,682 5 49,282
MovieLens-1M 1,000,209 6,040 3,706 5 458,455
MovieLens-25M 25,000,095 162,541 59,047 10 20,115,267
Last.FM 92,834 1,892 17,632 5,436 1
Epinions 664,825 40,163 139,738 5 1
Tafeng 464118 9238 7973 1 464118
Dunnhumby 2595732 2500 92339 1 2595732
Instacart 33,819,106 206,209 49,685 1 3,346,083
citeulike-a 204,986 240 16,980 1 1
citeulike-t 134,860 216 25,584 1 1
HetRec MoiveLens 855,598 2,113 10,109 10 809,328
HetRec Delicious 437,593 1,867 69,223 1 104,093
HetRec LastFM 186,479 1,892 12,523 1 9,749
Yelp 8,021,122 1,968,703 209,393 5 7,853,102
Gowalla 6,442,892 107,092 1,280,969 1 5,561,957
Yoochoose 1,150,753 509,696 735 1 19,949
Diginetica 1,235,380 310,324 122,993 1 152
Taobao 3,835,331 37,376 930,607 1 698,889
Ali-mobile 12,256,906 10,000 2,876,947 1 1
Retailrocket 2,756,101 1,407,58 235,061 1 2,749,921
Amazon Reviews -- Amazon Instant Video 583,933 426,922 23,965 5 3,027
Amazon Reviews -- Musical Instruments 500,176 339,231 83,046 5 5,339
Amazon Reviews -- Digital Music 836,006 478,235 266,414 5 5,941
Amazon Reviews -- Baby 915,446 531,890 64,426 5 4,869
Amazon Reviews -- Grocery and Gourmet Food 1,297,156 768,438 166,049 5 3,831
Amazon Reviews -- Patio, Lawn and Garden 993,490 714,791 105,984 5 4,929
Amazon Reviews -- Automotive 1,373,768 851,418 320,112 5 3,704
Amazon Reviews -- Pet Supplies 1,235,316 740,985 103,288 5 3,900
Amazon Reviews -- Cell Phones and Accessories 3,447,249 2,261,045 319,678 5 4,724
Amazon Reviews -- Health and Personal Care 2,982,326 1,851,132 252,331 5 4,733
Amazon Reviews -- Toys and Games 2,252,771 1,342,911 327,698 5 5,151
Amazon Reviews -- Video Games 1,324,753 826,767 50,210 5 5,396
Amazon Reviews -- Tools and Home Improvement 1,926,047 1,212,468 260,659 5 5,366
Amazon Reviews -- Beauty 2,023,070 1,210,271 249,274 5 4,231
Amazon Reviews -- Apps for Android 2,638,173 1,323,884 61,275 5 1,283
Amazon Reviews -- Office Products 1,243,186 909,314 130,006 5 5,400
Amazon Reviews -- Sports And Outdoors 3,268,695 1,990,521 478,898 5 4,786
Amazon Reviews -- Kindle Store 3205467 1,406,890 1,406,890 5 3,328
Amazon Reviews -- Home And Kitchen 4,253,926 2,511,610 410,243 5 5,202
Amazon Reviews -- Clothing Shoes And Jewelry 5,748,920 3,117,268 1,136,004 5 4,209
Amazon Reviews -- CDs And Vinyl 3,749,004 1,578,597 486,360 5 6,041
Amazon Reviews -- Movies And TV 4,607,047 2,088,620 200,941 5 6,004
Amazon Reviews -- Electronics 7,824,482 4,201,696 476,002 5 5,489
Amazon Reviews -- Books 22,507,155 8,026,324 2,330,066 5 6,296

Dataset Usage

Download Data

Beta-Recsys provides download interface for users to download different dataset. Here is an example:

import sys
import os
sys.path.append(os.path.abspath('.'))
from beta_rec.datasets.movielens import Movielens_1m

movielens_1m = Movielens_1m()
movielens_1m.download()

However, not every dataset could be downloaded directly with our framework. For some datasets, you will still have to download them manually. You are supposed to follow our tips to download and put the dataset in the correct folder in order to be detected by our framework.

Load Data

Downloading and preprocessing giant datasets may be a disturbing things, and in order to deal with this issue, we have preprocessed a wide range of datasets and stored the processed data in our remote server. Users can access them easily by using our load function.

import sys
import os
sys.path.append(os.path.abspath('.'))
from beta_rec.datasets.movielens import Movielens_1m

movielens_1m = Movielens_1m()
movielens_1m.load_leave_one_out()
movielens_1m.load_random_split()

Due to storage limitation, we only store a copy of split data with default parameters. If you want a custom split, you’ll still have to split them on you local machine.

Make Data

Users can simply ignore these functions because when you use custom parameters in load functions, it will automatically call make functions. So you don’t need to care about this functions. We strongly recommend you to use load function directly in most of you time.


Data Split

For users who are willing to split some datasets that are not covered by our framework, we still provide various methods to make it easy to split huge data, without caring the implementation details. There are 6 main methods for users to split data.

random_split

This method splits data into random train and test subsets.

This method will first shuffle all the data and then select a portion of records based on the given test_rate randomly.

random_basket_split

This method will select a portion of baskets(one basket may cover more than one record) based on the given test_rate randomly.

leave_one_out

This method will first rank all the records by time (if a timestamp column is provided), and then select the last record.

leave_one_basket

This method provides train/test indices to split data in train/test sets. Each sample is used once as a test set while the remaining samples form the training set.

This method will first rank all the records by time (if a timestamp column is provided), and then select the last basket.

Due to the high number of test sets this method can be very costly.

temporal_split

This method will first rank all the records by time (if a timestamp column is provided), and then select the last portion of records.

This splitting approach is for evaluating how well a model performs on segments drawn from the same time series but excluded from the training set.

temporal_basket_split

This method will first rank all the records by time (if a timestamp column is provided), and then select the last portion of baskets.


Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset’s license.

If you’re a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the RecSys community!

More

For any quesitons, please tell us by creating an issue or contact us by sending an email to recsys.beta@gmail.com. We will try to respond it as soon as possible.