beta_rec.utils package

beta_rec.utils.alias_table module

class beta_rec.utils.alias_table.AliasTable(obj_freq)[source]

Bases: object

AliasTable Class.

A list of indices of tokens in the vocab following a power law distribution, used to draw negative samples.

sample(count, obj_num=1, no_repeat=False)[source]

Generate samples.

Parameters:
  • count – the number of tokens in a draw.
  • obj_num – the number of draws.
  • no_repeat – whether repeat tokens are allowed in a single draw.
Returns:

A list of tokens.

Raises:

ValueError – count is larger than vocab_size when no_repeat is True.

beta_rec.utils.common_util module

class beta_rec.utils.common_util.DictToObject(dictionary)[source]

Bases: object

Python dict to object.

beta_rec.utils.common_util.ensureDir(dir_path)[source]

Ensure a dir exist, otherwise create the path.

Parameters:dir_path (str) – the target dir.
beta_rec.utils.common_util.get_data_frame_from_gzip_file(path)[source]

Get Dataframe from a gzip file.

Parameters:the file path of gzip file. (path) –
Returns:A dataframe extracted from the gzip file.
beta_rec.utils.common_util.get_dataframe_from_npz(data_file)[source]

Get the DataFrame from npz file.

Get the DataFrame from npz file.

Parameters:data_file (str or Path) – File path.
Returns:the unzip data.
Return type:DataFrame
beta_rec.utils.common_util.get_random_rep(raw_num, dim)[source]

Generate a random embedding from a normal (Gaussian) distribution.

Parameters:
  • raw_num – Number of raw to be generated.
  • dim – The dimension of the embeddings.
Returns:

ndarray or scalar. Drawn samples from the normal distribution.

beta_rec.utils.common_util.normalized_adj_single(adj)[source]

Missing docs.

Parameters:adj
Returns:None.
beta_rec.utils.common_util.parse_gzip_file(path)[source]

Parse gzip file.

Parameters:path – the file path of gzip file.
beta_rec.utils.common_util.print_dict_as_table(dic, tag=None, columns=['keys', 'values'])[source]

Print a dictionary as table.

Parameters:
  • dic (dict) – dict object to be formatted.
  • tag (str) – A name for this dictionary.
  • columns ([str,str]) – default [“keys”, “values”]. columns name for keys and values.
Returns:

None

beta_rec.utils.common_util.save_dataframe_as_npz(data, data_file)[source]

Save DataFrame in compressed format.

Save and convert the DataFrame to npz file. :param data: DataFrame to be saved. :type data: DataFrame :param data_file: Target file path.

beta_rec.utils.common_util.save_to_csv(result, result_file)[source]

Save a result dict to disk.

Parameters:
  • result – The result dict to be saved.
  • result_file – The file path to be saved.
beta_rec.utils.common_util.set_seed(seed)[source]

Initialize all the seed in the system.

Parameters:seed – A global random seed.
beta_rec.utils.common_util.str2bool(v)[source]

Convert a string to a bool variable.

beta_rec.utils.common_util.timeit(method)[source]

Generate decorator for tracking the execution time for the specific method.

Parameters:method – The method need to timeit.
To use:

@timeit def method(self):

pass
Returns:None
beta_rec.utils.common_util.un_zip(file_name, target_dir=None)[source]

Unzip zip files.

Parameters:
  • file_name (str or Path) – zip file path.
  • target_dir (str or Path) – target path to be save the unzipped files.
beta_rec.utils.common_util.update_args(config, args)[source]

Update config parameters by the received parameters from command line.

Parameters:
  • config (dict) – Initial dict of the parameters from JSON config file.
  • args (object) – An argparse Argument object with attributes being the parameters to be updated.

beta_rec.utils.constants module

beta_rec.utils.download module

beta_rec.utils.download.download_file(url, store_file_path)[source]

Download the raw dataset file.

Download the dataset with the given url and save to the store_path.

Parameters:
  • url – the url that can be downloaded the dataset file.
  • store_file_path – the path that stores the downloaded file.
Returns:

the archive format of the suffix.

beta_rec.utils.download.download_file_from_onedrive(url, path)[source]

Download processed file from OneDrive.

Download file from OneDrive with the give url and save to the given path.

Parameters:
  • url – the shared link generated by OneDrive.
  • path – the path supposed to store the file.
beta_rec.utils.download.get_format(suffix)[source]

Get the archive format.

Get the archive format of the archive file with its suffix.

Parameters:suffix – suffix of the archive file.
Returns:the archive format of the suffix.

beta_rec.utils.evaluation module

class beta_rec.utils.evaluation.PandasHash(pandas_object)[source]

Bases: object

Wrapper class to allow pandas objects (DataFrames or Series) to be hashable.

pandas_object
beta_rec.utils.evaluation.auc(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]

Calculate the Area-Under-Curve metric.

Calculate the Aread-Under-Curve metric for implicit feedback typed recommender, where rating is binary and prediction is float number ranging from 0 to 1.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve

Note

The evaluation does not require a leave-one-out scenario. This metric does not calculate group-based AUC which considers the AUC scores averaged across users. It is also not limited to k. Instead, it calculates the scores on the entire prediction results regardless the users.

Parameters:
  • rating_true (pd.DataFrame) – True data.
  • rating_pred (pd.DataFrame) – Predicted data.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
Returns:

auc_score (min=0, max=1).

Return type:

float

beta_rec.utils.evaluation.check_column_dtypes(func)[source]

Check columns of DataFrame inputs.

This includes the checks on
  1. whether the input columns exist in the input DataFrames.
  2. whether the data types of col_user as well as col_item are matched in the two input DataFrames.
Parameters:func (function) – function that will be wrapped.
beta_rec.utils.evaluation.exp_var(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]

Calculate explained variance.

Parameters:
  • rating_true (pd.DataFrame) – True data. There should be no duplicate (userID, itemID) pairs.
  • rating_pred (pd.DataFrame) – Predicted data. There should be no duplicate (userID, itemID) pairs.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
Returns:

Explained variance (min=0, max=1).

Return type:

float

beta_rec.utils.evaluation.get_top_k_items(dataframe, col_user='col_user', col_rating='col_rating', k=10)[source]

Get the input customer-item-rating tuple in the format of Pandas.

DataFrame, output a Pandas DataFrame in the dense format of top k items for each user.

Note

if it is implicit rating, just append a column of constants to be ratings.

Parameters:
  • dataframe (pandas.DataFrame) – DataFrame of rating data (in the format
  • customerID-itemID-rating)
  • col_user (str) – column name for user.
  • col_rating (str) – column name for rating.
  • k (int) – number of items for each user.
Returns:

DataFrame of top k items for each user.

Return type:

pd.DataFrame

beta_rec.utils.evaluation.has_columns(df, columns)[source]

Check if DataFrame has necessary columns.

Parameters:
  • df (pd.DataFrame) – DataFrame.
  • columns (list(str) – columns to check for.
Returns:

True if DataFrame has specified columns.

Return type:

bool

beta_rec.utils.evaluation.has_same_base_dtype(df_1, df_2, columns=None)[source]

Check if specified columns have the same base dtypes across both DataFrames.

Parameters:
  • df_1 (pd.DataFrame) – first DataFrame.
  • df_2 (pd.DataFrame) – second DataFrame.
  • columns (list(str)) – columns to check, None checks all columns.
Returns:

True if DataFrames columns have the same base dtypes.

Return type:

bool

beta_rec.utils.evaluation.logloss(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]

Calculate the logloss metric.

Calculate the logloss metric for implicit feedback typed recommender, where rating is binary and prediction is float number ranging from 0 to 1.

https://en.wikipedia.org/wiki/Loss_functions_for_classification#Cross_entropy_loss_(Log_Loss)

Parameters:
  • rating_true (pd.DataFrame) – True data.
  • rating_pred (pd.DataFrame) – Predicted data.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
Returns:

log_loss_score (min=-inf, max=inf).

Return type:

float

beta_rec.utils.evaluation.lru_cache_df(maxsize, typed=False)[source]

Least-recently-used cache decorator.

Parameters:
  • maxsize (int|None) – max size of cache, if set to None cache is boundless.
  • typed (bool) – arguments of different types are cached separately.
beta_rec.utils.evaluation.mae(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]

Calculate Mean Absolute Error.

Parameters:
  • rating_true (pd.DataFrame) – True data. There should be no duplicate (userID, itemID) pairs.
  • rating_pred (pd.DataFrame) – Predicted data. There should be no duplicate (userID, itemID) pairs.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
Returns:

Mean Absolute Error.

Return type:

float

beta_rec.utils.evaluation.map_at_k(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction', relevancy_method='top_k', k=10, threshold=10)[source]

Mean Average Precision at k.

The implementation of MAP is referenced from Spark MLlib evaluation metrics. https://spark.apache.org/docs/2.3.0/mllib-evaluation-metrics.html#ranking-systems

A good reference can be found at: http://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf

Note

1. The evaluation function is named as ‘MAP is at k’ because the evaluation class takes top k items for the prediction items. The naming is different from Spark. 2. The MAP is meant to calculate Avg. Precision for the relevant items, so it is normalized by the number of relevant items in the ground truth data, instead of k.

Parameters:
  • rating_true (pd.DataFrame) – True DataFrame.
  • rating_pred (pd.DataFrame) – Predicted DataFrame.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
  • relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
  • k (int) – number of top k items per user.
  • threshold (float) – threshold of top items per user (optional).
Returns:

MAP at k (min=0, max=1).

Return type:

float

beta_rec.utils.evaluation.merge_ranking_true_pred(rating_true, rating_pred, col_user, col_item, col_rating, col_prediction, relevancy_method, k=10, threshold=10)[source]

Filter truth and prediction data frames on common users.

Parameters:
  • rating_true (pd.DataFrame) – True DataFrame.
  • rating_pred (pd.DataFrame) – Predicted DataFrame.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
  • relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
  • k (int) – number of top k items per user (optional).
  • threshold (float) – threshold of top items per user (optional).
Returns:

DataFrame of recommendation hits DataFrmae of hit counts vs actual relevant items per user number of unique user ids.

Return type:

pd.DataFrame, pd.DataFrame, int

beta_rec.utils.evaluation.merge_rating_true_pred(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]

Join truth and prediction data frames on userID and itemID.

Joint truth and prediction DataFrames on userID and itemID and return the true and predicted rated with the correct index.

Parameters:
  • rating_true (pd.DataFrame) – True data.
  • rating_pred (pd.DataFrame) – Predicted data.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
Returns:

Array with the true ratings. np.array: Array with the predicted ratings.

Return type:

np.array

beta_rec.utils.evaluation.ndcg_at_k(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction', relevancy_method='top_k', k=10, threshold=10)[source]

Compute Normalized Discounted Cumulative Gain (nDCG).

Info: https://en.wikipedia.org/wiki/Discounted_cumulative_gain

Parameters:
  • rating_true (pd.DataFrame) – True DataFrame.
  • rating_pred (pd.DataFrame) – Predicted DataFrame.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
  • relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
  • k (int) – number of top k items per user.
  • threshold (float) – threshold of top items per user (optional).
Returns:

nDCG at k (min=0, max=1).

Return type:

float

beta_rec.utils.evaluation.precision_at_k(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction', relevancy_method='top_k', k=10, threshold=10)[source]

Precision at K.

Note: We use the same formula to calculate precision@k as that in Spark. More details can be found at http://spark.apache.org/docs/2.1.1/api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RankingMetrics.precisionAt In particular, the maximum achievable precision may be < 1, if the number of items for a user in rating_pred is less than k.

Parameters:
  • rating_true (pd.DataFrame) – True DataFrame.
  • rating_pred (pd.DataFrame) – Predicted DataFrame.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
  • relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
  • k (int) – number of top k items per user.
  • threshold (float) – threshold of top items per user (optional).
Returns:

precision at k (min=0, max=1).

Return type:

float

beta_rec.utils.evaluation.recall_at_k(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction', relevancy_method='top_k', k=10, threshold=10)[source]

Recall at K.

Parameters:
  • rating_true (pd.DataFrame) – True DataFrame.
  • rating_pred (pd.DataFrame) – Predicted DataFrame.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
  • relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
  • k (int) – number of top k items per user.
  • threshold (float) – threshold of top items per user (optional).
Returns:

recall at k (min=0, max=1). The maximum value is 1 even when fewer than

k items exist for a user in rating_true.

Return type:

float

beta_rec.utils.evaluation.rmse(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]

Calculate Root Mean Squared Error.

Parameters:
  • rating_true (pd.DataFrame) – True data. There should be no duplicate (userID, itemID) pairs.
  • rating_pred (pd.DataFrame) – Predicted data. There should be no duplicate (userID, itemID) pairs.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
Returns:

Root mean squared error.

Return type:

float

beta_rec.utils.evaluation.rsquared(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]

Calculate R squared.

Parameters:
  • rating_true (pd.DataFrame) – True data. There should be no duplicate (userID, itemID) pairs.
  • rating_pred (pd.DataFrame) – Predicted data. There should be no duplicate (userID, itemID) pairs.
  • col_user (str) – column name for user.
  • col_item (str) – column name for item.
  • col_rating (str) – column name for rating.
  • col_prediction (str) – column name for prediction.
Returns:

R squared (min=0, max=1).

Return type:

float

beta_rec.utils.logger module

class beta_rec.utils.logger.Logger(filename='default', stdout=None, stderr=None)[source]

Bases: object

Logger Class.

flush()[source]

Missing Doc.

write(message)[source]

Log out message.

beta_rec.utils.logger.get_logger(filename='default', level='info')[source]

Get logger.

beta_rec.utils.logger.init_logger(log_file_name='log', console=True, error=True, debug=False)[source]

Initialize logger.

beta_rec.utils.logger.init_std_logger(log_file='default')[source]

Initialize std logger.

beta_rec.utils.monitor module

class beta_rec.utils.monitor.Monitor(log_dir, delay=1, gpu_id=0, verbose=False)[source]

Bases: threading.Thread

Monitor Class.

run()[source]

Run the monitor.

stop()[source]

Stop the monitor.

write_cpu_status()[source]

Write CPU status.

write_gpu_status()[source]

Write gpu usage status.

write_mem_status()[source]

Write memory usage status.

beta_rec.utils.monitor.devices_status()[source]

Print current devices status.

beta_rec.utils.monitor.print_cpu_stat()[source]

Print CPU status.

beta_rec.utils.monitor.print_gpu_stat(gpu_id=None)[source]

Print GPU status.

beta_rec.utils.monitor.print_mem_stat(memoryInfo=None)[source]

Print memory status.

beta_rec.utils.onedrive module

class beta_rec.utils.onedrive.OneDrive(url=None, path=None)[source]

Bases: object

Download shared file/folder to localhost with persisted structure.

Download shared file/folder from OneDrive without authentication.

params: str:url: url to the shared one drive folder or file str:path: local filesystem path

methods: download() -> None: fire async download of all files found in URL

download()[source]

Download files from OneDrive.

Download files from OneDrive with the given share link.

beta_rec.utils.seq_evaluation module

beta_rec.utils.seq_evaluation.count_a_in_b_unique(a, b)[source]

Count unique items.

Parameters:
  • a (List) – list of lists.
  • b (List) – list of lists.
Returns:

number of elements of a in b.

Return type:

count (int)

beta_rec.utils.seq_evaluation.mrr(ground_truth, prediction)[source]

Compute Mean Reciprocal Rank metric. Reciprocal Rank is set 0 if no predicted item is in contained the ground truth.

Parameters:
  • ground_truth (List) – the ground truth set or sequence
  • prediction (List) – the predicted set or sequence
Returns:

the value of the metric

Return type:

rr (float)

beta_rec.utils.seq_evaluation.ndcg(ground_truth, prediction)[source]

Compute Normalized Discounted Cumulative Gain (NDCG) metric.

Parameters:
  • ground_truth (List) – the ground truth set or sequence.
  • prediction (List) – the predicted set or sequence.
Returns:

the value of the metric.

Return type:

ndcg (float)

beta_rec.utils.seq_evaluation.precision(ground_truth, prediction)[source]

Compute Precision metric.

Parameters:
  • ground_truth (List) – the ground truth set or sequence
  • prediction (List) – the predicted set or sequence
Returns:

the value of the metric

Return type:

precision_score (float)

beta_rec.utils.seq_evaluation.recall(ground_truth, prediction)[source]

Compute Recall metric.

Parameters:
  • ground_truth (List) – the ground truth set or sequence
  • prediction (List) – the predicted set or sequence
Returns:

the value of the metric

Return type:

recall_score (float)

beta_rec.utils.seq_evaluation.remove_duplicates(li)[source]

Remove duplicated items in the list.

beta_rec.utils.triple_sampler module

class beta_rec.utils.triple_sampler.Sampler(df_train, sample_file, n_sample, dump=True, load_save=False)[source]

Bases: object

Sampler Class.

load_triples_from_file(triple_file)[source]

Load triples from file.

sample()[source]

Generate samples.

sample_by_time(time_step)[source]

Generate samples by time.

beta_rec.utils.unigram_table module

class beta_rec.utils.unigram_table.UnigramTable(obj_freq)[source]

Bases: object

UnigramTable Class.

A list of indices of tokens in the vocab following a power law distribution, used to draw negative samples.

sample(count, obj_num=1, no_repeat=False)[source]

Generate samples.

Module contents

Utils Module.