beta_rec.datasets package

beta_rec.datasets.ali_mobile module

class beta_rec.datasets.ali_mobile.AliMobile(min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

AliMobile Dataset.

AliMobile dataset. This dataset is used to develop an individualized recommendation system of all items, it is similar to the taobao dataset.

The dataset can not be download by the url, you need to down the dataset by ‘https://tianchi.aliyun.com/dataset/dataDetail?dataId=46’ and then put it into the directory ali_mobile/raw

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory

Download datasets if not existed. ali_mobile_name: UserBehavior.csv

  1. Download ali_mobile dataset if this dataset is not existed.
  2. Load AliMobile <ali-mobile-interaction> table from ‘tianchi_mobile_recommend_train_user.csv’.
  3. Save dataset model.
beta_rec.datasets.ali_mobile.process_time(standard_time=None)[source]

Transform time format “xxxx-xx-xxTxx-xx-xxZ” into format “xxxx-xx-xx xx-xx-xx”.

Transform a standard time into our specified format.

Parameters:standard_time – str with format “xxxx-xx-xxTxx-xx-xxZ”.
Returns:timestamp data.
Return type:timestamp

beta_rec.datasets.citeulike module

class beta_rec.datasets.citeulike.CiteULikeA(min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

CiteULike-A.

CiteULike-A dataset. The dataset can not be download by the url, you need to down the dataset by ‘https://github.com/js05212/citeulike-a’, then put it into the directory citeulike-a/raw

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction, and save in the processed directory.

class beta_rec.datasets.citeulike.CiteULikeT(dataset_name='citeulike-t', min_u_c=0, min_i_c=3)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

CiteULike-T.

CiteULike-T dataset. The dataset can not be download by the url, you need to down the dataset by ‘https://github.com/js05212/citeulike-t’, and then put it into the directory citeulike-t/raw/citeulike-t.

load_leave_one_out(random=False, n_negative=100, n_test=10, download=False)[source]

Load leave one out split data.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

beta_rec.datasets.data_load module

beta_rec.datasets.data_load.load_item_fea_dic(config, fea_type)[source]

Load item feature.

Parameters:
  • config (dict) – Dictionary of configuration
  • fea_type (str) – A string describing the feature type. Options: - one_hot - word2vec - bert - cate
Returns:

A dictionary with key being the item_id and value being the numpy array of feature vector

Return type:

dict

beta_rec.datasets.data_load.load_split_dataset(config)[source]

Load split dataset.

Parameters:config (dict) – Dictionary of configuration
Returns:Interaction for training. valid_data list(DataFrame): List of interactions for validation. test_data list(DataFrame): List of interactions for testing.
Return type:train_data (DataFrame)
beta_rec.datasets.data_load.load_user_fea_dic(config, fea_type)[source]

Load user feature.

Parameters:
  • config (dict) – Dictionary of configuration
  • fea_type (str) – A string describing the feature type. Options:
Returns:

A dictionary with key being the item_id and value being the numpy array of feature vector

Return type:

dict

beta_rec.datasets.data_load.load_user_item_feature(config)[source]

Load features of users and items.

Parameters:config (dict) – Dictionary of configuration
Returns:The first column is the user id, rest column are feat vectors item_feat (numpy.ndarray): The first column is the itm id, rest column are feat vectors
Return type:user_feat (numpy.ndarray)

beta_rec.datasets.data_split module

beta_rec.datasets.data_split.check_data_available(data)[source]

Check if a dataset is available after filtering.

Check whether a given dataset is available for later use.

Parameters:data (DataFrame) – interaction DataFrame to be processed.
Raises:RuntimeError – An error occurred it there is no interaction.
beta_rec.datasets.data_split.feed_neg_sample(data, negative_num, item_sampler)[source]

Sample negative items for a interaction DataFrame.

Parameters:
  • data (DataFrame) – interaction DataFrame to be processed.
  • negative_num (int) – number of negative items. if negative_num<0, will keep all the negative items for each user.
  • item_sampler (AliasTable) – a AliasTable sampler that contains the items.
Returns:

interaction DataFrame with a new ‘flag’ column labeling with “train”, “test” or “valid”.

Return type:

DataFrame

beta_rec.datasets.data_split.filter_by_count(df, group_col, filter_col, num)[source]

Filter out the group_col column values that have a less than num count of filter_col.

Parameters:
  • df (DataFrame) – interaction DataFrame to be processed.
  • group_col (string) – column name to be filtered.
  • filter_col (string) – column with the filter condition.
  • num (int) – minimum count condition that should be filter out.
Returns:

The filtered interactions.

Return type:

DataFrame

beta_rec.datasets.data_split.filter_user_item(df, min_u_c=5, min_i_c=5)[source]

Filter data by the minimum purchase number of items and users.

Parameters:
  • df (DataFrame) – interaction DataFrame to be processed.
  • min_u_c (int) – filter the items that were purchased by less than min_u_c users. (default: 5)
  • min_i_c (int) – filter the users that have purchased by less than min_i_c items. (default: 5)
Returns:

The filtered interactions

Return type:

DataFrame

beta_rec.datasets.data_split.filter_user_item_order(df, min_u_c=5, min_i_c=5, min_o_c=5)[source]

Filter data by the minimum purchase number of items and users.

Parameters:
  • df (DataFrame) – interaction DataFrame to be processed.
  • min_u_c – filter the items that were purchased by less than min_u_c users.
  • (default5)
  • min_i_c – filter the users that have purchased by less than min_i_c items.
  • (default5)
  • min_o_c – filter the users that have purchased by less than min_o_c orders.
  • (default5)
Returns:

The filtered DataFrame.

beta_rec.datasets.data_split.generate_parameterized_path(test_rate=0, random=False, n_negative=100, by_user=False)[source]

Generate parameterized path.

Encode parameters into path to differentiate different split parameters.

Parameters:
  • by_user (bool) – split by user.
  • test_rate (float) – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • random (bool) – Whether random leave one item/basket as testing. only for leave_one_out and leave_one_basket.
  • n_negative (int) – Number of negative samples for testing and validation data.
Returns:

A string that encodes parameters.

Return type:

string

beta_rec.datasets.data_split.generate_random_data(n_interaction, user_id, item_id)[source]

Generate random data for testing.

Generate random data for unit test.

beta_rec.datasets.data_split.leave_one_basket(data, random=False)[source]

leave_one_basket split.

Parameters:
  • data (DataFrame) – interaction DataFrame to be split.
  • random (bool) – Whether randomly leave one item/basket as testing. only for leave_one_out and leave_one_basket.
Returns:

DataFrame that have already by labeled by a col with “train”, “test” or “valid”.

Return type:

DataFrame

beta_rec.datasets.data_split.leave_one_out(data, random=False)[source]

leave_one_out split.

Parameters:
  • data (DataFrame) – interaction DataFrame to be split.
  • random (bool) – Whether randomly leave one item/basket as testing. only for leave_one_out and leave_one_basket.
Returns:

DataFrame that have already by labeled by a col with “train”, “test” or “valid”.

Return type:

DataFrame

beta_rec.datasets.data_split.load_split_data(path, n_test=10)[source]

Load split DataFrame from a specified path.

Parameters:
  • path (string) – split data path.
  • n_test – number of testing and validation datasets. If n_test==0, will load the original (no negative items) valid and test datasets.
Returns:

DataFrame of training interaction, DataFrame list of validation interaction, DataFrame list of testing interaction,

Return type:

(DataFrame, list(DataFrame), list(DataFrame))

beta_rec.datasets.data_split.random_basket_split(data, test_rate=0.1, by_user=False)[source]

random_basket_split.

Parameters:
  • data (DataFrame) – interaction DataFrame to be split.
  • test_rate (float) – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • by_user (bool) – Default False. - True: user-based split, - False: global split,
Returns:

DataFrame that have already by labeled by a col with “train”, “test” or “valid”.

Return type:

DataFrame

beta_rec.datasets.data_split.random_split(data, test_rate=0.1, by_user=False)[source]

random_basket_split.

Parameters:
  • data (DataFrame) – interaction DataFrame to be split.
  • test_rate (float) – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • by_user (bool) – Default False. - Ture: user-based split, - False: global split,
Returns:

DataFrame that have already by labeled by a col with “train”, “test” or “valid”.

Return type:

DataFrame

beta_rec.datasets.data_split.save_split_data(data, base_dir, data_split='leave_one_basket', parameterized_dir=None, suffix='train.npz')[source]

Save DataFrame to compressed npz.

Parameters:
  • data (DataFrame) – interaction DataFrame to be saved.
  • parameterized_dir (string) – data_split parameter string.
  • suffix (string) – suffix of the data to be saved.
  • base_dir (string) – directory to save.
  • data_split (string) – sub folder name for saving the data.
beta_rec.datasets.data_split.split_data(data, split_type, test_rate, random=False, n_negative=100, save_dir=None, by_user=False, n_test=10)[source]

Split data by split_type and other parameters.

Parameters:
  • data (DataFrame) – interaction DataFrame to be split
  • split_type (string) – options can be: - random - random_basket - leave_one_out - leave_one_basket - temporal - temporal_basket
  • random (bool) – Whether random leave one item/basket as testing. only for leave_one_out and leave_one_basket.
  • test_rate (float) – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • n_negative (int) – Number of negative samples for testing and validation data.
  • save_dir (string or Path) – Default None. If specified, the split data will be saved to the dir.
  • by_user (bool) – Default False. - True: user-based split, - False: global split,
  • n_test (int) – Default 10. The number of testing and validation copies.
Returns:

The split data. Note that the returned data will not have negative samples.

Return type:

DataFrame

beta_rec.datasets.data_split.temporal_basket_split(data, test_rate=0.1, by_user=False)[source]

temporal_basket_split.

Parameters:
  • data (DataFrame) – interaction DataFrame to be split. It must have a col DEFAULT_ORDER_COL.
  • test_rate (float) – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • by_user (bool) – Default False. - True: user-based split, - False: global split,
Returns:

DataFrame that have already by labeled by a col with “train”, “test” or “valid”.

Return type:

DataFrame

beta_rec.datasets.data_split.temporal_split(data, test_rate=0.1, by_user=False)[source]

temporal_split.

Parameters:
  • data (DataFrame) – interaction DataFrame to be split.
  • test_rate (float) – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • by_user (bool) – bool. Default False. - True: user-based split, - False: global split,
Returns:

DataFrame that have already by labeled by a col with “train”, “test” or “valid”.

Return type:

DataFrame

beta_rec.datasets.dataset_base module

class beta_rec.datasets.dataset_base.DatasetBase(dataset_name, min_u_c=0, min_i_c=3, min_o_c=0, url=None, root_dir=None, manual_download_url=None, processed_leave_one_out_url='', processed_leave_one_basket_url='', processed_random_split_url='', processed_random_basket_split_url='', processed_temporal_split_url='', processed_temporal_basket_split_url='', tips=None)[source]

Bases: object

Base class for processing raw dataset into interactions, making and loading data splits.

This is an beta dataset which can derive to other dataset. Several directory that store the dataset file would be created in the initial process.

dataset_name

the dataset name.

min_u_c

filter the items that were purchased by less than min_u_c users.

(default: 0) min_i_c: filter the users that have purchased by less than min_i_c items. (default: 3) min_o_c: filter the users that have purchased by less than min_o_c orders. (default: 0)

url: the url of raw files. manual_download_url: the url that users use to download raw files manually
download()[source]

Download the raw dataset.

Download the dataset with the given url and unpack the file.

load_interaction()[source]

Load the user-item interaction. And filter users, items or orders.

Returns:Loaded interactions after filtering
Return type:DataFrame

Load the interaction from the processed file(Need to preprocess the raw file before loading)

load_leave_one_basket(random=False, n_negative=100, n_test=10, download=False, force_redo=False)[source]

Load split date generated by leave_one_basket without random select.

Load split data generated by leave_one_basket without random select from Onedrive.

Parameters:
  • random – bool. Whether randomly leave one basket as testing.
  • n_negative – Number of negative samples for testing and validation data.
  • download (bool) – Whether download the split produced by the Beta-rec team (With random seed:2020).
  • force_redo (bool) – Whether force to re-split the dataset.
  • n_test – int. Default 10. The number of testing and validation copies. If n_test==0, will load the original (no negative items) valid and test datasets.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

load_leave_one_out(random=False, n_negative=100, n_test=10, download=False, force_redo=False)[source]

Load split data generated by leave_out_out without random select.

Load split data generated by leave_out_out without random select from Onedrive.

Parameters:
  • random (bool) – . Whether randomly leave one item as testing.
  • n_negative (int) – Number of negative samples for testing and validation data.
  • download (bool) – Whether download the split produced by the Beta-rec team (With random seed:2020).
  • force_redo (bool) – Whether force to re-split the dataset.
  • n_test (int) – Default 10. The number of testing and validation copies. If n_test==0, will load the original (no negative items) valid and test datasets.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

load_random_basket_split(test_rate=0.1, random=False, n_negative=100, by_user=False, n_test=10, download=False, force_redo=False)[source]

Load split date generated by random_basket_split.

Load split data generated by random_basket_split from Onedrive, with test_rate = 0.1 and by_user = False.

Parameters:
  • test_rate – percentage of the test data. Note that percentage of the validation data will be the same as test data.
  • random – bool. Whether randomly leave one basket as testing.
  • download (bool) – Whether download the split produced by the Beta-rec team (With random seed:2020).
  • force_redo (bool) – Whether force to re-split the dataset.
  • n_negative – Number of negative samples for testing and validation data.
  • by_user – bool. Default False. - True: user-based split, - False: global split,
  • n_test – int. Default 10. The number of testing and validation copies. If n_test==0, will load the original (no negative items) valid and test datasets.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

load_random_split(test_rate=0.1, random=False, n_negative=100, by_user=False, n_test=10, download=False, force_redo=False)[source]

Load split date generated by random_split.

Load split data generated by random_split from Onedrive, with test_rate = 0.1 and by_user = False.

Parameters:
  • test_rate – percentage of the test data. Note that percentage of the validation data will be the same as test data.
  • random – bool. Whether randomly leave one basket as testing.
  • download (bool) – Whether download the split produced by the Beta-rec team (With random seed:2020).
  • force_redo (bool) – Whether force to re-split the dataset.
  • n_negative – Number of negative samples for testing and validation data.
  • by_user – bool. Default False. - Ture: user-based split, - False: global split,
  • n_test – int. Default 10. The number of testing and validation copies. If n_test==0, will load the original (no negative items) valid and test datasets.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

load_split(config)[source]

Load split data by config dict.

Parameters:config (dict) – config (dict): Dictionary of configuration
Returns:Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing
Return type:train_data (DataFrame)
load_temporal_basket_split(test_rate=0.1, n_negative=100, by_user=False, n_test=10, download=False, force_redo=False)[source]

Load split date generated by temporal_basket_split.

Load split data generated by temporal_basket_split from Onedrive, with test_rate = 0.1 and by_user = False.

Parameters:
  • test_rate – percentage of the test data. Note that percentage of the validation data will be the same as test data.
  • n_negative – Number of negative samples for testing and validation data.
  • by_user – bool. Default False. - True: user-based split, - False: global split,
  • n_test – int. Default 10. The number of testing and validation copies. If n_test==0, will load the original (no negative items) valid and test datasets.
  • download (bool) – Whether download the split produced by the Beta-rec team (With random seed:2020).
  • force_redo (bool) – Whether force to re-split the dataset.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

load_temporal_split(test_rate=0.1, n_negative=100, by_user=False, n_test=10, download=False, force_redo=False)[source]

Load split date generated by temporal_split.

Load split data generated by temporal_split from Onedrive, with test_rate = 0.1 and by_user = False.

Parameters:
  • test_rate – percentage of the test data. Note that percentage of the validation data will be the same as test data.
  • n_negative – Number of negative samples for testing and validation data.
  • by_user – bool. Default False. - True: user-based split, - False: global split,
  • n_test – int. Default 10. The number of testing and validation copies. If n_test==0, will load the original (no negative items) valid and test datasets.
  • download (bool) – Whether download the split produced by the Beta-rec team (With random seed:2020).
  • force_redo (bool) – Whether force to re-split the dataset.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

make_leave_one_basket(data=None, random=False, n_negative=100, n_test=10)[source]

Generate split data with leave_one_basket.

Generate split data with leave_one_basket method.

Parameters:
  • data (DataFrame) – DataFrame to be split.
  • random – bool. Whether randomly leave one basket as testing.
  • n_negative – Number of negative samples for testing and validation data.
  • n_test – int. Default 10. The number of testing and validation copies.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

make_leave_one_out(data=None, random=False, n_negative=100, n_test=10)[source]

Generate split data with leave_one_out.

Generate split data with leave_one_out method.

Parameters:
  • data (DataFrame) –

    DataFrame to be split. - Default is None. It will load the raw interaction, with a default filter

    ’’’ filter_user_item(data, min_u_c=0, min_i_c=3) ‘’’
    • Users can specify their filtered data by using filter methods in data_split.py
  • random – bool. Whether randomly leave one item as testing.
  • n_negative – Number of negative samples for testing and validation data.
  • n_test – int. Default 10. The number of testing and validation copies.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

make_random_basket_split(data=None, test_rate=0.1, n_negative=100, by_user=False, n_test=10)[source]

Generate split data with random_basket_split.

Generate split data with random_basket_split method.

Parameters:
  • data (DataFrame) – DataFrame to be split.
  • test_rate – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • n_negative – Number of negative samples for testing and validation data.
  • by_user – bool. Default False. - True: user-based split, - False: global split,
  • n_test – int. Default 10. The number of testing and validation copies.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

make_random_split(data=None, test_rate=0.1, n_negative=100, by_user=False, n_test=10)[source]

Generate split data with random_split.

Generate split data with random_split method

Parameters:
  • data (DataFrame) – DataFrame to be split. - Default is None. It will load the raw interaction, with a default filter ` data = filter_user_item(data, min_u_c=3, min_i_c=3) ` - Users can specify their filtered data by using filter methods in data_split.py
  • test_rate – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • n_negative – Number of negative samples for testing and validation data.
  • by_user – bool. Default False. - True: user-based split, - False: global split,
  • n_test – int. Default 10. The number of testing and validation copies.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

make_temporal_basket_split(data=None, test_rate=0.1, n_negative=100, by_user=False, n_test=10)[source]

Generate split data with temporal_basket_split.

Generate split data with temporal_basket_split method.

Parameters:
  • data (DataFrame) – DataFrame to be split. - Default is None. It will load the raw interaction, with a default filter - Users can specify their filtered data by using filter methods in data_split.py
  • test_rate – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • n_negative – Number of negative samples for testing and validation data.
  • by_user – bool. Default False. - True: user-based split, - False: global split,
  • n_test – int. Default 10. The number of testing and validation copies.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

make_temporal_split(data=None, test_rate=0.1, n_negative=100, by_user=False, n_test=10)[source]

Generate split data with temporal_split.

Generate split data with temporal_split method.

Parameters:
  • data (DataFrame) – DataFrame to be split. - Default is None. It will load the raw interaction, with a default filter ` data = filter_user_item(data, min_u_c=3, min_i_c=3) ` - Users can specify their filtered data by using filter methods in data_split.py
  • test_rate – percentage of the test data. Note that percentage of the validation data will be the same as testing.
  • n_negative – Number of negative samples for testing and validation data.
  • by_user – bool. Default False. - True: user-based split, - False: global split,
  • n_test – int. Default 10. The number of testing and validation copies.
Returns:

Interaction for training. valid_data list(DataFrame): List of interactions for validation test_data list(DataFrame): List of interactions for testing

Return type:

train_data (DataFrame)

preprocess()[source]

Preprocess the raw file.

A virtual function that needs to be implement in the derived class.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

beta_rec.datasets.diginetica module

class beta_rec.datasets.diginetica.Diginetica(dataset_name='diginetica', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Diginetica Dataset.

This is a dataset provided by DIGINETICA and its partners containing anonymized search and browsing logs, product data, anonymized transactions, and a large data set of product images. The participants have to predict search relevance of products according to the personal shopping, search, and browsing preferences of the users. Both ‘query-less’ and ‘query-full’ sessions are possible. The evaluation is based on click and transaction data.

The dataset can not be download by the url, you need to down the dataset by ‘https://cikm2016.cs.iupui.edu/cikm-cup/’ then put it into the directory diginetica/raw, then unzip this file and rename the new directory to ‘diginetica’.

Note: you also need unzip files in ‘diginetica/raw/diginetica’.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory

Download datasets if not existed. diginetica_name: train-item-views.csv

  1. Download diginetica dataset if this dataset is not existed.
  2. Load diginetica <diginetica-item-views> table from ‘diginetica.csv’.
  3. Add rating column and create timestamp column.
  4. Save data model.
beta_rec.datasets.diginetica.process_time(standard_time=None)[source]

Transform time format “xxxx-xx-xx” into format “xxxx-xx-xx xx-xx-xx”.

If there is no specified hour-minute-second data, we use 00:00:00 as default value.

Parameters:standard_time – str with format “xxxx-xx-xx”.
Returns:timestamp data.
Return type:timestamp

beta_rec.datasets.dunnhumby module

class beta_rec.datasets.dunnhumby.Dunnhumby(min_u_c=0, min_i_c=3, min_o_c=0, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Dunnhumby Dataset.

If the dataset can not be download by the url, you need to down the dataset by the link:

then put it into the directory dunnhumby/raw

parse_raw_data(data_base_dir='./unzip/')[source]

Parse raw dunnhumby csv data from transaction_data.csv.

Parameters:data_base_dir (path) – Default dir is “./unzip/”.
Returns:DataFrame of interactions.
preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory

beta_rec.datasets.epinions module

class beta_rec.datasets.epinions.Epinions(dataset_name='epinions', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Epinions Dataset.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

beta_rec.datasets.gowalla module

class beta_rec.datasets.gowalla.Gowalla(dataset_name='gowalla', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Gowalla Dataset.

Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API, and consists of 196,591 nodes and 950,327 edges. We have collected a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.

If the dataset can not be download by the url, you need to down the dataset by the link:

then put it into the directory gowalla/raw and unzip it.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory

Download datasets if not existed. Gowalla_checkin_name: Gowalla_totalCheckins.txt Gowalla_edges_name : Gowalla_edges.txt

  1. Download gowalla dataset if this dataset is not existed.
  2. Load gowalla <Gowalla_checkin> table from ‘Gowalla_totalCheckins.txt’.
  3. Process time columns and transform it into timestamp.
  4. Rename and save dataset model.
beta_rec.datasets.gowalla.process_time(standard_time=None)[source]

Transform time format “xxxx-xx-xxTxx-xx-xxZ” into format “xxxx-xx-xx xx-xx-xx”.

Parameters:standard_time – str with format “xxxx-xx-xxTxx-xx-xxZ”.
Returns:timestamp data.
Return type:timestamp

beta_rec.datasets.hetrec module

class beta_rec.datasets.hetrec.Delicious_2k(dataset_name='delicious-2k', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

delicious-2k Dataset.

This dataset contains social networking, bookmarking, and tagging information from a set of 2K users from Delicious social bookmarking system. http://www.delicious.com.

If the dataset can not be download by the url, you need to down the dataset in the following link: ‘http://files.grouplens.org/datasets/hetrec2011/hetrec2011-delicious-2k.zip’ then put it into the directory delicious-2k/raw.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

class beta_rec.datasets.hetrec.LastFM_2k(dataset_name='lastfm-2k', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Lastfm-2k Dataset.

This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.

If the dataset can not be download by the url, you need to down the dataset by the link:

then put it into the directory delicious-2k/raw.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

class beta_rec.datasets.hetrec.MovieLens_2k(dataset_name='movielens-2k', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

MovieLens-2k Dataset.

If the dataset can not be download by the url, you need to down the dataset by the link: ‘http://files.grouplens.org/datasets/hetrec2011/hetrec2011-movielens-2k-v2.zip’ then put it into the directory `movielens-2k/raw.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

beta_rec.datasets.instacart module

class beta_rec.datasets.instacart.Instacart(dataset_name='instacart', min_u_c=0, min_i_c=3, min_o_c=0, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Instacart Dataset.

If the dataset can not be download by the url, you need to down the dataset by the link:

then put it into the directory instacart/raw, unzip this file and rename the directory in ‘instacart’.

Instacart dataset is used to predict when users buy product for the next time, we construct it with structure [order_id, product_id].

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

Download and load datasets 1. Download instacart dataset if this dataset is not existed. 2. Load <order> table and <order_products> table from “orders.csv” and “order_products__train.csv”. 3. Merge the two tables above. 4. Add additional columns [rating, timestamp]. 5. Rename columns and save data model.

class beta_rec.datasets.instacart.Instacart_25(dataset_name='instacart_25', min_u_c=0, min_i_c=3, min_o_c=0)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Instacart Dataset.

If the dataset can not be download by the url, you need to down the dataset by the link:

then put it into the directory instacart/raw, unzip this file and rename the directory in ‘instacart’.

Instacart dataset is used to predict when users buy product for the next time, we construct it with structure [order_id, product_id].

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory

Download and load datasets 1. Download instacart dataset if this dataset is not existed. 2. Load <order> table and <order_products> table from “orders.csv” and “order_products__train.csv”. 3. Merge the two tables above. 4. Add additional columns [rating, timestamp]. 5. Rename columns and save data model.

beta_rec.datasets.last_fm module

class beta_rec.datasets.last_fm.LastFM(dataset_name='last_fm', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

LastFM Dataset.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a DataFrame consist of the user-item interaction and save in the processed directory.

beta_rec.datasets.movielens module

class beta_rec.datasets.movielens.Movielens_100k(dataset_name='ml_100k', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Movielens 100k Dataset.

load_fea_vec()[source]

Load feature vectors for users and items. 1. For items (movies), we use the last 19 fields as feature, which are the genres, with 1 indicating the movie is of that genre, and 0 indicating it is not; movies can be in several genres at once. 2. For users, we construct one_hot encoding for age, gender and occupation as their feature, where ages are categorized into 8 groups. :returns: The first column is the user id, rest column are feat vectors.

item_feat (numpy.ndarray): The first column is the itm id, rest column are feat vectors.
Return type:user_feat (numpy.ndarray)
make_fea_vec()[source]

Make feature vectors for users and items. 1. For items (movies), we use the last 19 fields as feature, which are the genres, with 1 indicating the movie is of that genre, and 0 indicating it is not; movies can be in several genres at once. 2. For users, we construct one_hot encoding for age, gender and occupation as their feature, where ages are categorized into 8 groups. :returns: The first column is the user id, rest column are feat vectors.

item_feat (numpy.ndarray): The first column is the item id, rest column are feat vectors.
Return type:user_feat (numpy.ndarray)
preprocess()[source]

Preprocess the raw file. Preprocess the file downloaded via the url, convert it to a dataframe consisting of the user-item interactions and save it in the processed directory.

class beta_rec.datasets.movielens.Movielens_10m(dataset_name='ml_10m', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Movielens 10m Dataset.

preprocess()[source]

Preprocess the raw file. Preprocess the file downloaded via the url, convert it to a DataFrame consisting of the user-item interactions and save it in the processed directory.

class beta_rec.datasets.movielens.Movielens_1m(dataset_name='ml_1m', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Movielens 1m Dataset.

preprocess()[source]

Preprocess the raw file. Preprocess the file downloaded via the url, convert it to a DataFrame consisting of the user-item interactions and save it in the processed directory.

class beta_rec.datasets.movielens.Movielens_25m(dataset_name='ml_25m', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Movielens 25m Dataset.

preprocess()[source]

Preprocess the raw file. Preprocess the file downloaded via the url, convert it to a DataFrame consisting of the user-item interactions and save it in the processed directory.

beta_rec.datasets.retailrocket module

class beta_rec.datasets.retailrocket.RetailRocket(dataset_name='retailrocket', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

RetailRocket Dataset.

This data has been collected from a real-world e-commerce website. It is raw data without any content transformations, however, all values are hashed due to confidential issue. The purpose of publishing is to motivate researches in the field of recommendation systems with implicit feedback.

If the dataset can not be download by the url, you need to down the dataset by the link:

then put it into the directory retailrocket/raw and unzip it.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a DataFrame consist of the user-item interaction and save in the processed directory.

Download dataset if not existed. retail_rocket_name: UserBehavior.csv

  1. Download RetailRocket dataset if this dataset is not existed.
  2. Load RetailRocket <retail-rocket-interaction> table from ‘events.csv’.
  3. Save dataset model.

beta_rec.datasets.seq_data_utils module

class beta_rec.datasets.seq_data_utils.SeqDataset(data, print_info=True)[source]

Bases: torch.utils.data.dataset.Dataset

Sequential Dataset.

beta_rec.datasets.seq_data_utils.collate_fn(data)[source]

Pad the sequences.

This function will be used to pad the sessions to max length in the batch and transpose the batch from batch_size x max_seq_len to max_seq_len x batch_size. It will return padded vectors, labels and lengths of each session (before padding) It will be used in the Dataloader.

Parameters:data (pytorch Dataset) – Sequential dataset.
Returns:Padded vectors. labels (Tensor): Target item. lens (list): Lengths of each padded vector.
Return type:padded_sesss (Tensor)
beta_rec.datasets.seq_data_utils.create_seq_db(data)[source]

Convert interactions of a user to a sequence.

Parameters:data (pandas.DataFrame) – The dataset to be transformed.
Returns:Transformed dataset with “col_user” and “col_sequence”.
Return type:result (pandas.DataFrame)
beta_rec.datasets.seq_data_utils.dataset_to_seq_target_format(data)[source]

Convert a list of sequences to (seq,target) format.

Parameters:data (pandas.DataFrame) – The dataset to be transformed.
Returns:Context sequence. labs (List): Labels of the context sequence, each element is the last item in the origin sequence.
Return type:out_seqs (List)
beta_rec.datasets.seq_data_utils.load_dataset(config)[source]

Load datasets.

Parameters:config (dict) – Dictionary of configuration.
Returns:Full dataset.
Return type:dataset (pandas.DataFrame)
beta_rec.datasets.seq_data_utils.reindex_items(train_data, valid_data=None, test_data=None)[source]

Reindex the item ids.

Item ids are reindexed from 1. “0” is left for padding.

Parameters:
  • train_data (pandas.DataFrame) – Training set.
  • valid_data (pandas.DataFrame) – Validation set.
  • test_data (pandas.DataFrame) – Test set.
Returns:

Reindexed training set. valid_data (pandas.DataFrame): Reindexed validation set. test_data (pandas.DataFrame): Reindexed test set.

Return type:

train_data (pandas.DataFrame)

beta_rec.datasets.tafeng module

class beta_rec.datasets.tafeng.Tafeng(dataset_name='tafeng', min_u_c=0, min_i_c=3, min_o_c=0, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Tafeng Dataset.

The dataset can not be download by the url, you need to down the dataset by ‘https://1drv.ms/u/s!AjMahLyQeZqugjc2k3eCAwKavccB?e=Qn5ppw’ then put it into the directory tafeng/raw.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

beta_rec.datasets.taobao module

class beta_rec.datasets.taobao.Taobao(dataset_name='taobao', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Taobao Dataset.

This dataset is created by randomly selecting about 1 million users who have behaviors including click, purchase, adding item to shopping cart and item favoring during November 25 to December 03, 2017.

The dataset is organized in a very similar form to MovieLens-20M, i.e., each line represents a specific user-item interaction, which consists of user ID, item ID, item’s category ID, behavior type and timestamp, separated by commas.

The dataset can not be download by the url, you need to down the dataset by ‘https://tianchi.aliyun.com/dataset/dataDetail?dataId=649’ then put it into the directory taobao/raw.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

Download datasets if not existed. taobao_name: UserBehavior.csv.

  1. Download taobao dataset if this dataset is not existed.
  2. Load taobao <taobao-interaction> table from ‘taobao.csv’.
  3. Save dataset model.

beta_rec.datasets.yelp module

class beta_rec.datasets.yelp.Yelp(dataset_name='yelp', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

Yelp Dataset.

The dataset can not be download by the url, you need to down the dataset by ‘https://www.yelp.com/dataset’ then put it into the directory yelp/raw/yelp.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

beta_rec.datasets.yoochoose module

class beta_rec.datasets.yoochoose.YooChoose(dataset_name='yoochoose', min_u_c=0, min_i_c=3, root_dir=None)[source]

Bases: beta_rec.datasets.dataset_base.DatasetBase

YooChoose Dataset.

Task of YooChoose dataset: given a sequence of click events performed by some users during a typical session in an e-commerce website, the goal is to predict whether the user is going to buy something or not, and if he is buying, what would be the items he is going to buy. The task could therefore be divided into two sub goals:

  1. Is the user going to buy items in this session? YES|NO
  2. If yes, what are the items that are going to be bought?
This dataset contains two subsets:
  1. yoochoose-clicks.dat
    • SessionID: the id of the session.
    • Timestamp: the time when the click occurred.
    • ItemID: the unique identifier of the item.
    • Category: the category of the item.
  2. yoochoose-buys.dat
    • SessionID: the id of the session.
    • Timestamp: the time when the click occurred.
    • ItemID: the unique identifier of the item.
    • Price: the price of the item.
    • Quantity: how many of this item were bought.

If the dataset can not be download by the url, you need to down the dataset by the link:

then put it into the directory yoochoose/raw and unzip it.

preprocess()[source]

Preprocess the raw file.

Preprocess the file downloaded via the url, convert it to a dataframe consist of the user-item interaction and save in the processed directory.

Download datasets if not existed. yoochoose_name: yoochoose-buys.dat

  1. Download gowalla dataset if this dataset is not existed.
  2. Load yoochoose <yoochoose-buy> table from ‘yoochoose-buys.dat’.
  3. Rename and save dataset model.

Module contents

Datasets Module.