Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (2024)

Ting-Ji Huang^1,2 Jia-Qi Yang^1,2 Chunxu Shen³ Kai-Qi Liu⁴
De-Chuan Zhan^1,2 Han-Jia Ye^1,2
¹School of Artificial Intelligence, Nanjing University
²National Key Laboratory for Novel Software Technology, Nanjing University
³WeChat Technical Architecture Department, Tencent Inc.
⁴Software Institute, Nanjing University
Corresponding author, email: yehj@lamda.nju.edu.cn.

Abstract

Characterizing users and items through vector representations is crucial for various tasks in recommender systems. Recent approaches attempt to apply Large Language Models (LLMs) in recommendation through a question&answer format, where real users and items (e.g., Item No.2024) are represented with in-vocabulary tokens (e.g., “item”, “20”, “24”). However, since LLMs are typically pretrained on natural language tasks, these in-vocabulary tokens lack the expressive power for distinctive users and items, thereby weakening the recommendation ability even after fine-tuning on recommendation tasks. In this paper, we explore how to effectively tokenize users and items in LLM-based recommender systems. We emphasize the role of out-of-vocabulary (OOV) tokens in addition to the in-vocabulary ones and claim the memorization of OOV tokens that capture correlations of users/items as well as diversity of OOV tokens. By clustering the learned representations from historical user-item interactions, we make the representations of user/item combinations share the same OOV tokens if they have similar properties. Furthermore, integrating these OOV tokens into the LLM’s vocabulary allows for better distinction between users and items and enhanced capture of user-item relationships during fine-tuning on downstream tasks. Our proposed framework outperforms existing state-of-the-art methods across various downstream recommendation tasks.

1 Introduction

Modern recommender systems (RS) play a crucial role in various applications like video recommendation[7], e-commerce[4], and social networking[10]. The recent advent of large language models (LLMs) offers a new direction of exploration in this realm. Models such as T5[29] and LLaMA[37], training on massive natural language data, achieve impressive language understanding capabilities in text generation and conversation tasks[48]. Their success drives explorations to use pre-trained LLMs as model backbones for handling various recommendation tasks like sequential recommendation, rating, and explanation task[3, 42]. In such frameworks, we describe task inputs as a query and expect an LLM to directly answer task labels[6, 13]. After fine-tuning through this Q&A format, we could ask the LLM to recommend something for a user and receive a real item answered.

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (1)

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (2)

How to characterize users and items so that LLMs can learn their personalized information during fine-tuning? Since tokens are the basic building blocks for LLMs and represent the smallest unit of text the model can understand and process, we should allocate some tokens to represent users/items and let LLMs learn their characteristics through tokens. While early works[6] directly represent a real item by text tokens (e.g., “blue”, “casual”, “T-shirt”), LLMs might answer text that does not correspond to a real existing item[17, 18]. Recent advancements explore representing users/items with Identifiers (IDs)[13, 40, 12, 2]. As illustrated in Figure LABEL:fig:teaser_a, a real item (Item No.2024) can be indexed with a sequence of in-vocabulary tokens (“item”, “20”, “24”), and we call such a token combination as an ID, where a numeric ID indicates that we use numeric tokens (“20”, “24”).

Characterizing users and items with numeric IDs is straightforward but also introduces the mapping problem, where it is hard to align limited numeric tokens to thousands of items in recommender systems, and token combinations are prone to lead to language conflicts[2, 17]. As shown in Figure LABEL:fig:teaser_b, using numeric tokens to represent items leads to similar representations of distinctive items. One solution is to create OOV tokens specifically allocated for IDs to allow distinction, as in the previous work on Collaborative ID (CID)[17].However, our experiments indicate that the use of OOV tokens defined in CID may improve diversity, it is hard for LLM to learn the correlations of users/items during fine-tuning (see Section3.3). Therefore, we should utilize redefined OOV tokens capable of capturing user/item correlations (memorization) while also distinguishing different items (diversity), two dimensions that are positively correlated for the performance of recommendation tasks.

In this paper, we present META ID (META-path-guided IDentifier), a framework for characterizing users/items using out-of-vocabulary (OOV) tokens for LLM-based recommendations. Initially, we generate meta-paths to represent user-item interactions and then obtain user and item representations from a skip-gram model trained on these meta-paths. A meta-path is a sequence that represents interactions between users and items in a graph structure. By clustering these meta-path-based representations, we create hierarchical groups that serve as OOV tokens for constructing user and item IDs. This approach extends beyond previous research, which has predominantly focused on item IDs[13, 30], by making the representations of users/item IDs share the same OOV tokens if they have similar properties. Finally, integrating these OOV tokens into the LLM’s vocabulary allows for better diversity and enhanced memorization during fine-tuning on downstream recommendation tasks. Additionally, we align the token embedding layer of LLMs with a linear transformation layer to enrich the OOV token representations as an augmentation. Our contributions are as follows:

•
We introduce memorization and diversity scores to evaluate ID representations in LLM-based recommender systems, focusing on capturing user/item correlations and ensuring diversity.
•
We develop META ID, which uses out-of-vocabulary tokens to characterize users/items, enhancing the memorization and diversity of their representations for LLMs.
•
The experiments show that META ID improves memorization and diversity score, which leads to improvements in various recommendation tasks.

2 Related Work

Characterizing Items by IDs. Modern recommendation models usually use unique IDs to represent users and items, which are subsequently converted to embedding vectors as learnable parameters[11]. Common approaches include matrix factorization[21, 32], two-tower models[38] and deep neural networks[35, 19, 49, 46, 43, 39], which make predictions by examining historical user-item interactions to identify behavioral patterns and enable collaborative recommendations[22]. Some recent approaches adopt the concept of ID to represent the token combinations that characterize items in LLMs[13, 17]. We follow previous studies and also call these token combinations as ID. This paper aims to combine LLM with ID more efficiently by proposing a new ID construction method.

Instruction Tuning for Recommendation. The integration of Large Language Models (LLMs) into diverse tasks has seen significant growth recently. Recent works fine-tuned pretrained language models on large-scale NLP datasets verbalized via human-readable prompts[33, 41]. These instruction tuning methods design prompts containing detailed task descriptions and adhere more to the natural language format. Driven by their exceptional natural language processing capabilities, researchers aim to transfer their linguistic ability to enhance recommender systems[25, 45]. These LLMs process user interactions as sequences of tokens and, through fine-tuning, predict users’ future interests based on past activities[47, 5]. Moreover, some studies reframe tasks like retrieval, rating, and explanation generation as language comprehension tasks[6, 13, 23], allowing LLMs to function as multi-task recommenders, producing recommendations and explanations with a unified architecture. In this paper, we apply LLMs based on instruction tuning of multi-task scenarios.

ID Construction with Tokens. Recent studies explore using in-vocabulary tokens to characterize items in LLMs, where users and items are represented by token combinations called IDs. Early efforts such as P5[13] convert user-item interactions into natural language formats using numeric IDs constructed of in-vocabulary tokens of the T5 model[29]. Further, sequential ID and collaborative ID are developed to enhance item information sharing[17]. A relevant study[30], constructs IDs using hierarchical tokens through RQ-VAE. Despite these advancements in ID construction, challenges remain in the lack of ID construction criteria and the focus on item IDs. In contrast, our META ID approach assigns memorization and diversity scores for evaluation and constructs both user and item IDs, which improves the characterization of users and items.

3 Preliminary

We describe recommendation tasks (e.g., sequential recommendation) under the instruction tuning setting, in which all data such as user-item interactions are converted to natural language sequences in a question&answer format (e.g., Input: Please recommend an item for user 2024 consider he has purchased item 2023; Output: item 2024). Details of this format are presented in AppendixC.

3.1 Instruction Tuning for Recommendation

Given a user set $\mathbf{U}$ , an item set $\mathbf{I}$ , we formulate a recommendation task as a natural language instruction that pairs an input token sequence $\mathbf{x}=(x_{1},...,x_{n},\boldsymbol{x}_{u},\boldsymbol{x}_{i})$ with a corresponding label token sequence $\mathbf{y}=(y_{1},...,y_{m},\boldsymbol{x}_{u},\boldsymbol{x}_{i})$ . The goal is to train and LLM $\mathcal{M}_{{\boldsymbol{\theta}}}$ to generate $\mathbf{y}$ given $\mathbf{x}$ . For simplification, we denote one token as $x_{i}$ or $y_{i}$ , and represent the ID of an user $u\in\mathbf{U}$ by $\boldsymbol{x}_{u}=(x_{u_{1}},x_{u_{2}},...)$ as its token combination, using the same representation format to a item ID $\boldsymbol{x}_{i}$ .

The LLM $\mathcal{M}_{{\boldsymbol{\theta}}}$ employs a token embedding layer $\boldsymbol{E}(\cdot)$ with parameters ${{\boldsymbol{\theta}}}_{\boldsymbol{E}}\in{{\boldsymbol{\theta}}}$ , functioning as a lookup table that transforms each input token into a token embedding. It subsequently predicts the probability distribution of label tokens by forward propagation. The training objective is to minimize the negative log-likelihood of the label tokens conditioned on the input sequence and previously generated tokens, formulated as:

\small{\boldsymbol{\theta}}^{*}=\operatorname*{arg\,min}_{{\boldsymbol{\theta}%}}\mathcal{L}_{{\boldsymbol{\theta}}}=-\sum_{j=1}^{|\mathbf{y}|}\log P_{{%\boldsymbol{\theta}}}\big{(}y_{j}\mid y_{<j},\boldsymbol{E}(\mathbf{x})\big{)},

(1)

where ${\boldsymbol{\theta}}^{*}$ represents the optimal set of parameters that we aim to learn. This supervised learning approach helps the model to internalize personalized information about users and items through the token embeddings of their respective IDs, $\boldsymbol{x}_{u}$ and $\boldsymbol{x}_{i}$ .

3.2 Represent Users/Items with In-Vocabulary Tokens

The token embedding layer $\boldsymbol{E}$ in LLMs transforms input tokens into corresponding token embeddings. Since users and items are also represented by IDs constructed of tokens, the corresponding token embeddings of ID $\boldsymbol{x}_{u}$ and $\boldsymbol{x}_{i}$ are the key to assess how LLMs capture and represent different items.We define the ID representation for an ID $\boldsymbol{x}_{i}$ as the average embeddings of its token combinations:

\mathbf{e}_{i}=\frac{1}{|\boldsymbol{x}_{i}|}\sum_{x\in\boldsymbol{x}_{i}}%\boldsymbol{E}(x),

(2)

where $\boldsymbol{E}(x)$ denotes the embedding vector of a token $x$ . For example, an item identified by “item2024” is represented by the average embedding vectors of three tokens (“item”, “20”, “24”).

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (3)

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (4)

3.3 Metric for Representation Evaluation

As mentioned above, using numeric IDs to represent items leads to similar representations of distinctive items. For an intuitive explanation, we first visualize the cosine similarity matrix between items using heatmaps in Figure 1, where RID and SID result in a large number of similar items due to semantic conflicts, and META ID constructed of OOV tokens shows distinguished similarity closer to the ground truth. We further plot the ID representations using T-SNE visualization in FigureLABEL:fig:score-tsne. It is clear that RID and SID (using in-vocabulary tokens) shrink in a small place relative to CID (using OOV tokens), reflecting in-vocabulary tokens lacking expressive power for distinctive users and items. Details are in Section5.3.

To quantify this, we introduce two metrics to assess the memorization and diversity of ID representations.

Diversity Score (DS) is a metric designed to quantify the diversity of ID representations within LLMs. Items represented by ID representations should be easily distinguishable for the model. For a pair of items $i$ and $j$ , their differentiation can be calculated using the Kullback-Leibler (KL) divergence on ID representations $\mathbf{e}_{i}\in E$ and $\mathbf{e}_{j}\in\mathbf{E}$ from Equation2, which reflects how distinct they are in the model’s embedding space:

\small\text{DS}(\mathbf{E})=\frac{1}{2N}\sum_{n=1}^{N}{\bigg{[}{D_{KL}(\mathbf%{e}_{i}||\mathbf{e}_{j})+D_{KL}(\mathbf{e}_{j}||\mathbf{e}_{i})}\bigg{]}}.

(3)

Here, $N$ represents the number of randomly selected item pairs, where we apply a sampling strategy to reduce the computational complexity, and we give a convergence analysis in Figure6.

Memorization Score (MS) quantifies the relationships captured by ID representations through measuring the similarity between items and users. We use the adjusted cosine similarity formula[34], to provides ground truth relational values between users and items, which are given by:

\small\text{sim}(i,j)=\frac{\sum_{u\in U}(R_{u,i}-\bar{R}_{u})\cdot(R_{u,j}-%\bar{R}_{u})}{{\sqrt{\sum_{u\in U}{(R_{u,i}-\bar{R}_{u})^{2}}\cdot{\sum_{u\in U%}{(R_{u,j}-\bar{R}_{u})^{2}}}}}},

(4)

where $R_{u,i}$ and $R_{u,j}$ denote user $u$ ’s ratings for items $i$ and $j$ , and $\bar{R}_{u}$ is user $u$ ’s average ratings. To assess the relationship captured by learned ID representations, we employ Mean Square Error (MSE) to calculate the similarity bias on their cosine similarity with their corresponding ground truth relation values, which forms the basis of the MS:

\small\text{MS}(\mathbf{E})=\frac{2}{|N|(|N|-1)}\sum_{i,j\in N,i<j}\left(\frac%{\mathbf{e}_{i}^{\top}\mathbf{e}_{j}}{|\mathbf{e}_{i}||\mathbf{e}_{j}|}-\text{%sim}(i,j)\right)^{2}.

(5)

Our quantitative assessment of the quality of these ID representations using the introduced metrics reveals some interesting findings. In FigureLABEL:fig:score, it shows that constructing IDs of in-vocabulary tokens (RID and SID) performs poorly in both diversity and memorization dimensions. While the use of OOV tokens (CID) may improve diversity, its memorization score leaves much to be desired. This suggests the need to explore better explore better forms of OOV tokens to construct IDs, to make them capture user/item correlations while also distinguishing different items.

Our quantitative assessment, based on the metrics we introduced, reveals intriguing insights into the quality of these ID representations. FigureLABEL:fig:score illustrates that IDs constructed from in-vocabulary tokens (RID and SID) perform poorly in terms of both diversity and memorization. Although the use of OOV tokens (CID) enhances diversity, its memorization score is unsatisfactory. These findings highlight the need to further develop more effective forms of OOV tokens for ID construction, aiming to improve LLMs’ ability to capture user/item correlations and distinguish items.

4 META ID

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (5)

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (6)

We now introduce the META ID framework to enhance LLMs in recommendation tasks. META ID involves creating out-of-vocabulary (OOV) tokens for constructing user and item IDs, which provide a rich, expressive space and encapsulate comprehensive, collaborative information. As illustrated in Figure 3, our process begins with sampling meta-paths from user-item interaction history. We then apply a skip-gram model to learn user and item representations from these meta-path sequences. This process ensures that users’ and items’ features are projected into a shared space to capture their interaction relationship better. This is followed by K-Means clustering to group similar users or items, after which unique OOV tokens are assigned to each cluster to construct the META ID.

4.1 Meta-path-based Embedding

We frame user-item interactions within a graph embedding learning paradigm[8], constructing an interaction graph composed of user nodes $U$ and item nodes $I$ , linked by interaction history with ratings $R$ . The core of this embedding learning strategy involves meta-paths[8] — a sequence of connections that reflect composite relationships within the graph. Our primary meta-path, denoted as U-I-U, materializes when users consistently rate an item with the same score $R_{i}$ , forming a path:

\small p=\{U_{1}\xrightarrow{R_{i}}I_{2}\xrightarrow{R_{i}}U_{3}\xrightarrow{R%_{i}}I_{4}...\xrightarrow{R_{i}}U_{k}\}^{|R|}_{i=1}.

(6)

We employ a skip-gram model to capture the interaction dynamics represented by these meta-paths. This model is trained on sequences generated from meta-path-based random walks[8], producing representations $\mathbf{W_{U}}=[\mathbf{w}_{u_{1}}^{\top},...,\mathbf{w}_{u_{m}}^{\top}]\in%\mathbb{R}^{d\times m}$ and $\mathbf{W_{I}}=[\mathbf{w}_{i_{1}}^{\top},...,\mathbf{w}_{i_{n}}^{\top}]\in%\mathbb{R}^{d\times n}$ , denoted the representations for $m$ users and $n$ items in a $d$ -dimension space respectively. Through this process, we acquire deep representations of user and item interactions, which are instrumental in enhancing the accuracy and relevance of our recommendation system.

4.2 OOV Token Generation

To balance diversity and memorization in large-scale recommendation systems, we adopt a cluster-based approach. Representations derived from user-item interactions are organized into a shared embedding space and segmented into $G$ clusters. Each cluster center, $\mathbf{\mu}_{g}$ , is defined by the average of representations within that cluster, effectively capturing the collective characteristics of its members. These centroids then categorize each user and item, providing a refined foundation for constructing granular IDs. The generation process is a two-step procedure:

1). Assign coarse-grained tokens based on centroids. We first cluster the learned representations W and set the number of clusters as $G$ . Each cluster centroid $\mathbf{\mu}_{g}$ is calculated as:

\small\mathbf{\mu}_{g}=\left\{\frac{1}{|\mathbb{I}(g_{i}=g)|}\sum_{(\mathbf{w}%_{i},g_{i})\in(\mathbf{W},G)}{\big{[}\mathbf{w}_{i}\cdot\mathbb{I}(g_{i}=g)%\big{]}}\right\}_{g\in[1,G]},

(7)

in which we use $i$ to represent a user/item for simplification. As a coarse-grained distinction between users and items, we assign an OOV token “ $\left\langle CT_{i}\right\rangle$ ” to each centroid.

2). Assign fine-grained tokens based on distance. Within each cluster, assign a token based on the distance to the centroid. In detail, we assign fine-grained tokens “ $\left\langle y_{i}\right\rangle$ ” in ascending order according to their distance from the cluster centroid, and use it as a fine-grained token.

The resulting identifiers, META ID, combine a coarse-grained token and a fine-grained one that uniquely identifies each user or item within that cluster. For example, an item might be represented as “ $\left\langle\text{{Item}}\right\rangle\left\langle CT_{i}\right\rangle\left%\langle y_{i}\right\rangle$ ” labeled with three tokens, “ $\left\langle\text{{Item}}\right\rangle$ ”, “ $\left\langle CT_{i}\right\rangle$ ”, “ $\left\langle y_{i}\right\rangle$ ”, where “ $\left\langle\text{{Item}}\right\rangle$ ” denotes it as an item, “ $\left\langle CT_{i}\right\rangle$ ” is its coarse-grained token and “ $\left\langle y_{i}\right\rangle$ ” is its fine-grained token.

This clustering and labeling process effectively compresses the vocabulary needed for ID representation while preserving the rich information necessary for recommendation tasks. In our implementation, we apply K-Means clustering[1], utilizing cosine similarity to measure the affinity of representations to cluster centers. This method simplifies the complexity of representations space management and proves to be robust in our experimental validations of Section5.4.

4.3 Integration of META ID with LLMs

The integration of OOV tokens of META ID, denoted as $\mathbf{x_{\text{OOV}}}$ , with LLMs involves expanding the vocabularies. This is achieved by extending the token embedding layer’s parameters from ${\boldsymbol{\theta}}_{\boldsymbol{E}}\in\mathbb{R}^{N\times d}$ to ${\boldsymbol{\theta}}_{\boldsymbol{E}^{\prime}}\in\mathbb{R}^{(N+n)\times d}$ , where $N$ is the number of in-vocabulary tokens, $n$ is the number of OOV tokens, and $d$ is the dimension of the token embeddings.

A good initialization helps token embedding learning, here we give a representation augmentation approach different from previous works using random initialization[13, 17]. As shown in FigureLABEL:fig:method_b, the OOV tokens undergo a linear layer $\boldsymbol{F}(\cdot)$ initialized with the category embeddings $\mathbf{\mu}_{g}$ from Equation7. Finally, the training objective for integrating META ID into LLMs is reformulated as:

\small{\boldsymbol{\theta}}^{\prime*}=\operatorname*{arg\,min}_{({\boldsymbol{%\theta}}^{\prime},\boldsymbol{F})}\mathcal{L}_{{\boldsymbol{\theta}}^{\prime}}%=-\sum_{j=1}^{|\mathbf{y}|}\log P_{\theta^{\prime}}\big{(}y_{j}\mid y_{<j},%\boldsymbol{E}^{\prime}(\mathbf{x}),\boldsymbol{F}(\mathbf{x_{\text{OOV}}})%\big{)},

(8)

where ${\boldsymbol{\theta}}^{\prime}$ now includes $\boldsymbol{F}$ , aligning with our modified embedding layer ${\boldsymbol{\theta}}_{\boldsymbol{E}^{\prime}}$ to optimize the model’s performance with the OOV tokens $\mathbf{x_{\text{OOV}}}$ . This training goal ensures the model can effectively distinguish between diverse users and items, improving its ability to capture user-item relationships. By enhancing memorization and diversity of OOV tokens, the model achieves better performance in recommendation tasks, leading to more accurate and personalized recommendations in our experiments.

MethodsSportsBeautyToysH@5N@5H@10N@10H@5N@5H@10N@10H@5N@5H@10N@10Caser [36]0.01160.00720.01940.00970.02050.01310.03470.01760.01660.01070.02700.0141HGN [26]0.01890.01200.03130.01590.03250.02060.05120.02660.03210.02210.04970.0277GRU4Rec [15]0.01290.00860.02040.01100.01640.00990.02830.01370.00970.00590.01760.0084BERT4Rec [35]0.01150.00750.01910.00990.02030.01240.03470.01700.01160.00710.02030.0099FDSA [14]0.01820.01220.02880.01560.02670.01630.04070.02080.02280.01400.03810.0189SASRec [19]0.02330.01540.03500.01920.03870.02490.06050.03180.04630.03060.06750.0374S³-Rec [49]0.02510.01610.03850.02040.03870.02440.06470.03270.04430.02940.07000.0376CL4SRec [43]0.02190.01380.03580.01820.03300.02010.05460.02700.04270.02440.06170.0305RID [17]0.02080.01220.02880.01530.02130.01780.04790.02770.00440.00290.00620.0035SID [13]0.02230.01730.02940.01960.04040.02990.05730.03540.00500.00310.00880.0043CID [17]0.02690.01960.03780.02310.03360.02270.05070.02810.01720.01090.02790.0143META ID (T)0.03220.02230.04870.02770.05100.03510.07530.04290.05030.03520.07420.0429META ID (L)0.03920.02780.05610.03320.04580.03200.06780.03910.03870.02640.05350.0312

5 Experiment

We evaluate META ID on five downstream recommendation tasks: sequential recommendation, direct recommendation, rating prediction, explanation generation, and review related tasks. We analyze the influence of critical components in META ID and assess the ID representations through visualization and our proposed metrics. Details of task descriptions and pre-processing are in Appendix C.

5.1 Evaluation on Sequential Recommendation

Setups. We evaluate our META ID framework on three public real-world datasets from the Amazon Product Reviews dataset [27], focusing specifically on Sports, Beauty, and Toys. The datasets are processed following the methodology in P5[13].

Baselines. We compare to a variety of established models (which are described briefly in Appendix C), spanning from CNN-based to LLM-based frameworks. Caser [36], HGN [26], GRU4Rec [15], BERT4Rec [35], FDSA [14], SASRec [19], S³-Rec [49] and CL4SRec [43]. Specifically, we provide P5 with its variations, equipped with different ID construction strategies like Sequential ID (SID), Random ID (RID), and Collaborative ID (CID)[17].

Evaluations. We apply widely accepted metrics, top-k Hit ratio (H@K) and Normalized discounted cumulative gain (N@K) with K = 5, 10 to evaluate the recommendation performance.

Implementation Details. For constructing META IDs, the clustering groups are limited to $|G|$ =100 (200 for Toys). For LLM fine-tuning, we consider both encoder-decoder architecture T5-small [28] and decoder-only architecture LLaMA2-7b [37]. We fully fine-tune the T5 model and employ the LoRA [16] to fine-tune LLaMA2-7b. Vocabulary sizes of these models are shown in Table9. For LLM inferencing, we use beam search to generate potential items evaluated under the all-item setting.

Results. Table 1 presents our findings for sequential recommendation task¹¹1We show the standard error of the metrics for META ID in Table11.. Our observations are as follows: 1) META ID demonstrates superior performance on all three datasets, underscoring its robustness. 2) IDs constructed of in-vocabulary tokens, RID and SID, underperform on Toys, suggesting limitations in their recommendation efficacy for LLMs. 3) CID shows marked improvements over RID and SID on Toys dataset, highlighting the benefits of incorporating OOV tokens with collaborative information. 4). While the LLaMA2-7b backbone is better in the Sports dataset, its performance in the Beauty and Toys dataset is not as good as T5, which could be linked to the distinct fine-tuning methodologies applied to these models.

5.2 Evaluation on Various Recommendation Tasks

Setups. To validate META ID’s adaptability, we extend our evaluation to include direct recommendation, rating prediction, explanation generation, and review tasks, akin to P5[13]. For direct recommendation, the model is asked to recommend item for users directly without providing user’s interaction history. For rating prediction, the model predicts a numerical rating between 1 and 5 based on user-item data. For explanation tasks, it generates textual justifications for a user’s preference towards an item, while in review tasks, it summarizes lengthy reviews into concise titles.

Baselines. We compare to three different ID construction strategies: RID, SID, and CID.

Evaluations. For direct recommendation, we apply the same metrics as in Section5.1. For rating prediction, we use MSE metric. For explanation and review tasks, we employ BLEU-1/4 metrics.

Implementation Details. For inferencing, we apply greedy decoding for rating, explanation, and review tasks, and beam search under the all-item setting for direct recommendation task.

Results.For direct recommendation (Table 2), META ID exceeds other methods across datasets in all-item setting. This suggests that META ID effectively model the direct relationship between users and items. The results for the other three tasks in Table 3, show that META ID significantly improves the BLEU scores Sports and Beauty compared to other methods. This result suggests that META ID can improve performance in text relevance tasks, including the interpretation of recommendations.

MethodsSportsBeautyToysH@5N@5H@10N@10H@5N@5H@10N@10H@5N@5H@10N@10RID[17]0.00300.00230.00420.00270.02030.01550.02760.01780.00460.00300.00630.0035SID[13]0.02110.01690.02670.01870.02960.02260.04050.02610.00250.00140.00410.0019CID[17]0.02500.01890.03420.02190.02160.01470.03400.01870.00760.00490.00140.0070META ID0.03570.02560.05200.03080.04800.03360.06890.04030.05640.03910.08030.0468

Task TypeMetricSportsBeautyRIDSIDCIDMETA IDRIDSIDCIDMETA IDRatingRMSE1.03821.04861.03831.03271.28291.30981.28191.2818ExplanationBLEU-116.256716.582516.612116.900518.229918.398119.349919.5106BLEU-42.17822.19442.23322.34812.90272.80713.06263.0592ReviewBLEU-17.61407.79487.65867.88196.22826.50556.58547.0500BLEU-42.32281.24062.41092.65461.98911.24061.97182.7485

5.3 Evaluation of ID Representation

Visualization. The amount of numeric tokens available in LLMs is relatively limited, which complicates the establishment of unique one-to-one ID relationships, and two unrelated items might share the tokens as ID. For an intuitive explanation, we visualized the cosine similarity matrix between items using heatmaps in Figure 1, where we random sample 50 items from the Toys dataset and take their adjusted cosine similarity from Equation4 as ground truth compared with RID, SID and META ID. RID and SID result in a large number of similar items due to semantic conflicts, while META ID shows distinguished similarity closer to ground truth. This suggests that using META ID allows LLMs to better capture relationships between users and items.

Quantitative Analysis. We quantitatively assess the quality of these ID representations by the proposed two metrics (Section 3.3): memorization score (MS) and diversity score (DS). Our results, shown in Figure LABEL:fig:score, indicate that constructing IDs of in-vocabulary tokens (RID and SID) perform poorly in the diversity dimension. For intuitive interpretation, we further employ t-SNE visualization to map ID representations and observe a tendency for these tokens to cluster narrowly in FigureLABEL:fig:score. META ID shows robust memorization and diversity across three datasets, reflecting its ability to

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (7)

capture correlations between users and items from historical data while ensuring items remain distinguishable.

Metrics Analysis. Furthermore, we conduct a correlation analysis to explore the relationship between MS/DS and sequential recommendation performance. We sum the MS and DS of different ID strategies and plot their performance on the sequential recommendation task. As shown in Figure 4, the sum of the MS and DS positively correlate with NDCG@10, suggesting that memorization and diversity of IDs are two essential properties in recommendation tasks.

5.4 Ablation Studies

We analyze the properties of META ID following the evaluation in Section 5.1, including the impact of

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (8)

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (9)

grouping methods, the size of OOV tokens, and different indexing ranges on the performance of META ID.

MethodsSportsBeautyToysH@5N@5H@10N@10H@5N@5H@10N@10H@5N@5H@10N@10DBSCAN [9]0.00780.00430.01450.00650.02570.01680.04280.02230.01800.01090.03040.0149Spectral [20]0.01990.01240.03360.01670.03600.02360.05880.03100.02950.01840.05140.0254RQ-VAE [44]0.01220.00770.01710.00930.03680.02540.05360.03090.05110.03350.06670.0395K-Means [1]0.03220.02230.04870.02770.05100.03510.07530.04290.05030.03520.07420.0429

Token Grouping.We study the importance of different grouping methods for OOV token generation in our framework. In Table 4, we compare the performance of DBSCAN [9] and Spectral Clustering [20] against K-Means clustering [1]. Since related work [30] has not yet opened source code, here we implement it ourselves by generating OOV tokens with RQ-VAE [44] using meta-path-based embeddings. Our results show that simply applying K-Means outperforms other grouping methods in most cases.

OOV Token Size. Since varying cluster sizes $G$ result in different numbers of OOV tokens, we also investigate the impact of different cluster sizes for META ID in FigureLABEL:fig:token_a. We find that the granularity of token clusters plays a crucial role in recommendation performance. An excessive token scale can introduce noise, reducing the performance. Therefore, finding an optimal token size is vital to ensure that META ID effectively adapts to various datasets’ nuances.

User or Item Indexing.Previous ID strategies for LLMs only consider indexing for items, which come from the convention that users are typically represented by a sequence of interacted items in sequential recommendation[13, 17, 30]. While META ID models users and items, as shown in TableLABEL:fig:token_b, reveals that the combined user-item indexing (User&Item) outperforms either user-only or item-only indexing. This result shows the importance of incorporating user preferences and item attributes for LLMs to enhance the accuracy of the recommendations.

6 Conclusion

This study introduces META ID, a method enhancing Large Language Models (LLMs) for recommender systems using OOV tokens. Moving beyond constructing IDs with in-vocabulary tokens, META ID incorporates user-item interaction information to align LLMs more effectively with recommendation tasks. We learn representations from user-item interactions utilizing meta-paths sampling. By clustering these representations we generate OOV tokens to construct META ID. This approach guarantees tokens capturing correlations between users and items from historical data while ensuring distinctiveness among item. Our experiments across various real-world datasets demonstrate META ID’s robust performance in diverse recommendation tasks, including sequential and direct recommendation, as well as complex tasks requiring detailed textual responses.Essentially, META ID effectively combines the capabilities of LLMs with the nuanced requirements of recommendation scenarios, such as planning highly personalized content for users as virtual shopping assistants.

References

Arthur and Vassilvitskii [2007]David Arthur and Sergei Vassilvitskii.k-means++: the advantages of careful seeding.In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 1027–1035. SIAM, 2007.
Bao etal. [2023]Keqin Bao, ji*zhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He.Tallrec: An effective and efficient tuning framework to align large language model with recommendation.In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys 2023, Singapore, Singapore, September 18-22, 2023, pages 1007–1014. ACM, 2023.
Bi etal. [2022]Qiwei Bi, Jian Li, Lifeng Shang, Xin Jiang, Qun Liu, and Hanfang Yang.Mtrec: Multi-task learning over BERT for news recommendation.In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2663–2669. Association for Computational Linguistics, 2022.
Chen etal. [2019]Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou.Behavior sequence transformer for e-commerce recommendation in alibaba.In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, DLP-KDD ’19. Association for Computing Machinery, 2019.
Chen [2023]Zheng Chen.PALR: personalization aware llms for recommendation.CoRR, abs/2305.07622, 2023.
Cui etal. [2022]Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang.M6-rec: Generative pretrained language models are open-ended recommender systems.CoRR, abs/2205.08084, 2022.
Davidson etal. [2010]James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, TaylorVan Vleet, Ullas Gargi, Sujoy Gupta, YuHe, Mike Lambert, Blake Livingston, and Dasarathi Sampath.The youtube video recommendation system.In Proceedings of the 2010 ACM Conference on Recommender Systems, RecSys 2010, Barcelona, Spain, September 26-30, 2010, pages 293–296. ACM, 2010.
Dong etal. [2017]Yuxiao Dong, NiteshV. Chawla, and Ananthram Swami.metapath2vec: Scalable representation learning for heterogeneous networks.In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 135–144. ACM, 2017.
Ester etal. [1996]Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu.A density-based algorithm for discovering clusters in large spatial databases with noise.In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pages 226–231. AAAI Press, 1996.
Fan etal. [2019]Wenqi Fan, Yao Ma, Qing Li, Yuan He, YihongEric Zhao, Jiliang Tang, and Dawei Yin.Graph neural networks for social recommendation.In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pages 417–426. ACM, 2019.
Fan etal. [2023]Wenqi Fan, Zihuai Zhao, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Jiliang Tang, and Qing Li.Recommender systems in the era of large language models (llms).CoRR, abs/2307.02046, 2023.
Gao etal. [2023]Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang.Chat-rec: Towards interactive and explainable llms-augmented recommender system.CoRR, abs/2303.14524, 2023.
Geng etal. [2022]Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang.Recommendation as language processing (RLP): A unified pretrain, personalized prompt & predict paradigm (P5).In RecSys ’22: Sixteenth ACM Conference on Recommender Systems, Seattle, WA, USA, September 18 - 23, 2022, pages 299–315. ACM, 2022.
Hao etal. [2023]Yongjing Hao, Tingting Zhang, Pengpeng Zhao, Yanchi Liu, VictorS. Sheng, Jiajie Xu, Guanfeng Liu, and Xiaofang Zhou.Feature-level deeper self-attention network with contrastive learning for sequential recommendation.IEEE Trans. Knowl. Data Eng., 35(10):10112–10124, 2023.
Hidasi etal. [2016]Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.Session-based recommendations with recurrent neural networks.In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
Hu etal. [2022]EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
Hua etal. [2023]Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang.How to index item ids for recommendation foundation models.pages 195–204, 2023.
Ji etal. [2023]Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung.Survey of hallucination in natural language generation.ACM Comput. Surv., 55(12):248:1–248:38, 2023.
Kang and McAuley [2018]Wang-Cheng Kang and JulianJ. McAuley.Self-attentive sequential recommendation.In IEEE International Conference on Data Mining, ICDM 2018, Singapore, November 17-20, 2018, pages 197–206. IEEE Computer Society, 2018.
Kluger etal. [2003]Yuval Kluger, Ronen Basri, JosephT Chang, and Mark Gerstein.Spectral biclustering of microarray data: coclustering genes and conditions.Genome research, 13(4):703–716, 2003.
Koren etal. [2009]Yehuda Koren, RobertM. Bell, and Chris Volinsky.Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009.
Koren etal. [2022]Yehuda Koren, Steffen Rendle, and RobertM. Bell.Advances in collaborative filtering.In Recommender Systems Handbook, pages 91–142. Springer US, 2022.
Li etal. [2023]Jinming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni.Gpt4rec: A generative framework for personalized recommendation and user interests interpretation.CoRR, abs/2304.03879, 2023.
Li etal. [2020]Lei Li, Yongfeng Zhang, and LiChen.Generate neural template explanations for recommendation.In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 755–764. ACM, 2020.
Li etal. [2021]Lei Li, Yongfeng Zhang, and LiChen.Personalized transformer for explainable recommendation.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4947–4957. Association for Computational Linguistics, 2021.
Ma etal. [2019]Chen Ma, Peng Kang, and Xue Liu.Hierarchical gating networks for sequential recommendation.In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pages 825–833. ACM, 2019.
Ni etal. [2019]Jianmo Ni, Jiacheng Li, and Julian McAuley.Justifying recommendations using distantly-labeled reviews and fine-grained aspects.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, 2019.
Raffel etal. [2020a]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21:140:1–140:67, 2020a.
Raffel etal. [2020b]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21:140:1–140:67, 2020b.
Rajput etal. [2023]Shashank Rajput, Nikhil Mehta, Anima Singh, RaghunandanH. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, YiTay, VinhQ. Tran, Jonah Samost, Maciej Kula, EdH. Chi, and Maheswaran Sathiamoorthy.Recommender systems with generative retrieval.CoRR, abs/2305.05065, 2023.
Ren etal. [2020]Ruiyang Ren, Zhaoyang Liu, Yaliang Li, WayneXin Zhao, Hui Wang, Bolin Ding, and Ji-Rong Wen.Sequential recommendation with self-attentive multi-adversarial network.In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 89–98. ACM, 2020.
Rendle etal. [2009]Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.BPR: bayesian personalized ranking from implicit feedback.In UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009, pages 452–461. AUAI Press, 2009.
Sanh etal. [2022]Victor Sanh, Albert Webson, Colin Raffel, StephenH. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, MSaiful Bari, Canwen Xu, Urmish Thakker, ShanyaSharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, NihalV. Nayak, Debajyoti Datta, Jonathan Chang, MikeTian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, ZhengXin Yong, Harsh*t Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, JasonAlan Fries, Ryan Teehan, TevenLe Scao, Stella Biderman, Leo Gao, Thomas Wolf, and AlexanderM. Rush.Multitask prompted training enables zero-shot task generalization.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
Sarwar etal. [2001]BadrulMunir Sarwar, George Karypis, JosephA. Konstan, and John Riedl.Item-based collaborative filtering recommendation algorithms.In Proceedings of the Tenth International World Wide Web Conference, WWW 10, Hong Kong, China, May 1-5, 2001, pages 285–295. ACM, 2001.
Sun etal. [2019]Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer.In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, pages 1441–1450. ACM, 2019.
Tang and Wang [2018]Jiaxi Tang and KeWang.Personalized top-n sequential recommendation via convolutional sequence embedding.In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 565–573. ACM, 2018.
Touvron etal. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov,and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models.CoRR, abs/2307.09288, 2023.
Wang etal. [2021]Jinpeng Wang, Jieming Zhu, and Xiuqiang He.Cross-batch negative sampling for training two-tower recommenders.In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 1632–1636. ACM, 2021.
Wang etal. [2023]Qi-Wei Wang, Hongyu Lu, YuChen, Da-Wei Zhou, De-Chuan Zhan, Ming Chen, and Han-Jia Ye.Streaming CTR prediction: Rethinking recommendation task for real-world streaming data.CoRR, abs/2307.07509, 2023.
Wang etal. [2022]Wei-Yao Wang, Wei-Wei Du, and Wen-Chih Peng.Recformer: personalized temporal-aware transformer for fair music recommendation.In Proceedings of the CIKM 2022 Workshops co-located with 31st ACM International Conference on Information and Knowledge Management (CIKM 2022), Atlanta, USA, October 17-21, 2022, volume 3318 of CEUR Workshop Proceedings. CEUR-WS.org, 2022.
Wei etal. [2022]Jason Wei, Maarten Bosma, VincentY. Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM. Dai, and QuocV. Le.Finetuned language models are zero-shot learners.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
Wu etal. [2021]Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang.Empowering news recommendation with pre-trained language models.In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 1652–1656. ACM, 2021.
Xie etal. [2022]XuXie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin Ding, and Bin Cui.Contrastive learning for sequential recommendation.In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022, pages 1259–1273. IEEE, 2022.
Zeghidour etal. [2022]Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi.Soundstream: An end-to-end neural audio codec.IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022.
Zhang etal. [2023a]Junjie Zhang, Ruobing Xie, Yupeng Hou, WayneXin Zhao, Leyu Lin, and Ji-Rong Wen.Recommendation as instruction following: A large language model empowered recommendation approach.CoRR, abs/2305.07001, 2023a.
Zhang etal. [2023b]Yi-Kai Zhang, Ting-Ji Huang, Yao-Xiang Ding, De-Chuan Zhan, and Han-Jia Ye.Model spider: Learning to rank pre-trained models efficiently.In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b.
Zhang etal. [2021]Yuhui Zhang, HAO DING, Zeren Shui, Yifei Ma, James Zou, Anoop Deoras, and Hao Wang.Language models as recommender systems: Evaluations and limitations.In I (Still) Can’t Believe It’s Not Better! NeurIPS 2021 Workshop, 2021.
Zhao etal. [2023]WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.A survey of large language models.CoRR, abs/2303.18223, 2023.
Zhou etal. [2020]Kun Zhou, Hui Wang, WayneXin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen.S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization.In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 1893–1902. ACM, 2020.

Appendix

We provide details omitted in the main paper.

•
AppendixA: Workflow of META ID, encompassing the construction of OOV tokens.
•
AppendixB: Details of memorization score (MS) and diversity score (DS).
•
AppendixC: Experimental setups and implementation details of META ID.
•
AppendixD: Additional experimental result analysis.
•
AppendixE: Discussions and limitations of META ID.

Appendix A Details of META ID

In the Section4 of the main text, we elucidate the comprehensive workflow for generating META ID. This process encompasses three main steps, including (1) the extraction of meta-path-based embedding, (2) the generation of OOV tokens, and (3) the incorporation of META ID with LLMs, thereby handling with various downstream recommendation tasks.

A.1 How to extract the meta-path-based embedding

This section supplements the details of subsection4.1, i.e., the users / items representations extracted from a skip-gram model, including the sampling process of meta-paths as training data.

In META ID, we enable a skip-gram model to learn effective users / items representations from the sampled meta-paths, which is learning user representations $\mathbf{W}_{U}$ and item representations $\mathbf{W}_{I}$ . The objective of the skip-gram model learning paradigm is to map the users and items in the meta-paths seqeuences into a lower-dimensional space as in[8]

Meta-paths sampling. Firstly, we constructs a node sequence based on random walks of meta-paths. A meta-path $p=P_{1}\xrightarrow{R_{1}}P_{2}\xrightarrow{R_{2}}...\xrightarrow{R_{k-1}}P_{k}$ is a path that is defined on a graph, where $R_{i}$ signifies a composite relation between different node $P$ . We define user-item-user (U-I-U) as our meta-path, where paths only exist if users has given the same ratings $R_{i}$ to one item. We sample 32 rounds starting from each user and item with the sampled length $k=64$ .

Skip-gram model training. In the second step, through sampling meta-paths based on random walks as training corpus, we train a skip-gram model thus learn the vector representations $(\mathbf{W}_{U},\mathbf{W}_{I})$ for all users and items. The objective of the skip-gram model is to maximize the conditional probability $P(n_{i}|v)$ for the node $v\in V$ of its neighboring node $n_{i}\in N_{v}$ :

\small\arg\max_{\theta}\sum_{v\in V}\sum_{t\in{U,I}}\sum_{n_{i}\in N_{v}}\log P%(n_{i}|v)

(9)

and the $P(n_{i}|v)$ is calculated as:

\small P(n_{i}|v)=\frac{\exp(w_{n_{i}}^{T}w_{v})}{\sum_{j\in V}\exp(w_{j}^{T}w%_{v})}

(10)

where $w$ means the representation of one user or item. In experiments, we set number of negative sampling to 5, and train the skip-gram model for 10 epoch with learning rate of 0.001.

A.2 How to generate OOV tokens

This section complements subsection 4.2, where we generate OOV tokens from users and items representations for constructing META ID. Essentially, we need to build a hierarchical classification system for IDs in order to express a wider range of items and users with as few OOV tokens as possible, so that similar items and users are under the same hierarchical branch.

This hierarchical construction mechanism is very reminiscent of clustering methods, as we apply in META ID, in which we use out-of-class indexes and in-class indexes as two levels of IDs. Though more sophisticated clustering method s for multi-levels structure can be applied, in experiments, we use the simple K-Means clustering, which is more suitable for large-scale data volume due to its simple and easy to optimize nature. We also demonstrate the effectiveness of this approach in Table4.

In experiments, we cluster user and item representations together. We then create the between-cluster tokens $\left\langle CT_{i}\right\rangle$ for cluster $i$ , and sort the in-cluster users / items based on the cosine distance to the cluster centroids to get in-cluster tokens $\left\langle y_{i}\right\rangle$ . Finally, we add a prefix token $\left\langle\text{{Item}}\right\rangle$ or $\left\langle\text{{User}}\right\rangle$ to denote its type. For example, an item might be represented as " $\left\langle\text{{Item}}\right\rangle\left\langle CT_{i}\right\rangle\left%\langle j\right\rangle$ " labeled with three tokens (" $\left\langle\text{{Item}}\right\rangle$ ", " $\left\langle CT_{i}\right\rangle$ ", " $\left\langle j\right\rangle$ "), where " $\left\langle\text{{Item}}\right\rangle$ " denotes it as an item, " $\left\langle CT_{i}\right\rangle$ " is its cluster token, and " $\left\langle y_{i}\right\rangle$ " is its in-cluster token.

It is worth noting that a naive approach to generating user and item IDs is to assign an independent OOV token as ID (IID) that needs to be learned for each item and user. However, this is not applicable to modern recommender systems with a large number of items and users, as the training time may be too long if a large number of new tokens need to be created. We also show it in Table7, where we use the meta-path-based embeddings with a linear projection layer for initialization, which shows that it is ineffective compared to META ID.

	Sequential Recommendation	Direct Recommendation	Rating Prediction
Task Input:	Considering user_2024 has interacted with items item_1, item_2. What is the next recommendation for the user?	What should we recommend for user_2024?	Which star rating will user_2024 give to item item_2? (1 being the lowest and 5 being the highest).
Task Output:	item_2024	item_2024	5

	Explanation	Review
Task Input:	According to the feature word quality, generate a 5-star explanation for user_2 about item_2.	Write a short sentence to summarize the following product review from user_2: Absolutely great product. I bought this for …
Task Output:	Absolutely great product!	Perfect!

MethodsSportsBeautyToysHR@5NDCG@5HR@10NDCG@10HR@5NDCG@5HR@10NDCG@10HR@5NDCG@5HR@10NDCG@10IID0.01140.00730.02080.01030.03020.01940.04940.02560.01460.00910.02170.0114META ID0.03570.02560.05200.03080.04800.03360.06890.04030.05640.03910.08030.0468

A.3 How to incorporate META ID with LLMs

As mentioned in subsection3.1, we convert every recommendation tasks into question&answering template, in which we describe the recommendation tasks in natural language form, and replace user and item IDs with different dataset like a cloze test. The full templates for every tasks in this format is from[13], where we give some examples in Table5 and Table6.

Take rating prediction task as example. We might ask the LLM, "Which star rating will user $\left\langle\text{User}\right\rangle\left\langle CT_{1}\right\rangle\left%\langle 18\right\rangle$ give item $\left\langle\text{Item}\right\rangle\left\langle CT_{8}\right\rangle\left%\langle 24\right\rangle$ ?", and expect the LLM to answer "5".

We construct the fine-tuning and testing dataset for LLM in this unified way. Then LLM is able to acquire the generalized knowledge across different tasks, and even carve out user and item characteristics through those tokens constructing IDs to handle different recommendation tasks.

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (10)

Appendix B Details of memorization score (MS) and diversity score (DS)

This section complements subsection3.3. The token embedding layer in LLMs transforms each input token into token embedding vectors. And we use ID representation to indicate that these token embedding vectors corresponding to an ID, i.e., the representation of an item or user in LLMs.

The convergence of DS. DS is a metric designed to quantify the diversity of ID representations in LLMs. Given the high computational demand of calculating KL divergence for all embedding pairs in large datasets, DS employs a random sampling approach, thus we present a convergence analysis for DS in Figure 6. The stability of DS is evident across both datasets, demonstrating a trend towards convergence as the number of sampled pairs grows. This sampling strategy reducesthe computational complexity from $O(|I|^{2}\cdot D)$ to $O(N\cdot D)$ , where the $|I|$ means the size of items, $D$ is the dimension of ID representation, which is equal to the LLM’s token embedding dimension.

The approximate value of MS. The adjusted cosine similarity for items is given by:

\text{sim}(i,j)=\frac{\sum_{u\in U}(R_{u,i}-\bar{R}_{u})\cdot(R_{u,j}-\bar{R}_%{u})}{{\sqrt{\sum_{u\in U}{(R_{u,i}-\bar{R}_{u})^{2}}\cdot{\sum_{u\in U}{(R_{u%,j}-\bar{R}_{u})^{2}}}}}}

(11)

where $R_{u,i}$ and $R_{u,j}$ denote user $u$ ’s ratings for items $i$ and $j$ , respectively, while $\bar{R}_{u}$ is user $u$ ’s average rating. To enhance computational efficiency, especially for large-scale datasets, we precalculate the rating deviation sums and squared sums for each item and user:

\text{sim}^{\prime}(i,j)=\frac{\text{Dev}(i)\cdot\text{Dev}(j)}{\sqrt{\text{%DevS}(i)}\cdot\sqrt{\text{DevS}(j)}}

(12)

where the rating deviation sums and squared sums for each item is:

\text{Dev}(i)=\sum_{u\in U_{i}}(R_{u,i}-\bar{R}_{u}),\ \text{DevS}(i)=\sum_{u%\in U_{i}}(R_{u,i}-\bar{R}_{u})^{2}

(13)

This approach reduces complexity from $O(|U|\cdot|I|^{2})$ to $O(|I|^{2})$ , a significant improvement for large-scale datasets.

Appendix C Experimental Setups and Implementation Details

Dataset	Sports	Beauty	Toys
#Users	35,598	22,363	19,412
#Items	18,357	12,101	11,924
#Reviews	296,337	198,502	167,597
#Sparsity (%)	0.0453	0.0734	0.0724

C.1 Datasets Descriptions and Preprocessing

We conduct extensive experiments over three real-world datasets. The Amazon datasets are collected from Amazon platform²²2https://nijianmo.github.io/amazon with user ratings and reviews on 29 categories of products. In this paper, we adopt three of them to evaluate our method, Sports&Outdoors, Beauty, Toys&Games. We follow[13] and use transaction records between January 1, 2019 to December 31, 2019. Detailed dataset statistics are available in Table 8.

We divide tasks into ratings, explanations, and reviews, adhering to the data-splitting approaches of similar studies[13, 24, 25]. For both sequential and direct recommendation tasks, we adopt the methodology of [49, 13, 31], using the final item in a user’s interaction sequence for testing while carefully structuring the training data to avoid leakage. For rating, explanation, and review task families, we randomly split each dataset into training (80%), validation (10%) and testing (10%) sets, and ensure that there is at least one instance included in the training set for each user and item.

C.2 Baselines

Our approach is compared to a variety of established models, spanning from CNN-based to LLM-based frameworks. Caser [36] applies CNNs to capture high-order Markov Chains in sequential recommendation. HGN [26] utilizes hierarchical gating networks for modeling long and short-term user interests. GRU4Rec [15] employs GRUs for session-based recommendation, representing items with embedding vectors. BERT4Rec [35], S³-rec [49] and SASRec [19] employ self-attention mechanisms for sequential recommendation, focusing on bidirectional understanding and multi-head attention, respectively. FDSA [14] adopts feature-level self-attention for feature transitions. CL4SRec [43] introduces contrastive learning with data augmentation in sequential recommendation. P5 [13] learns different tasks with the same language modeling objective during pretraining, serving as the foundation model for various downstream recommendation tasks. P5 [13] is a recent method that uses a pretrained Large Language Model (LLM) to unify different recommendation tasks in a single model. Since there is no open source code for the recent work[30] yet, we implemented the key ID construction of it ourselves in subsection5.4.

In particular, we provide P5 with its variations[17], equipped with different ID constructs like Sequential IDs (SID), Random IDs (RID), and Collaborative IDs (CID) as a benchmark for exploring the impact of different ID strategies. RID Assigns each item with a random number as the item ID. The number is further tokenized into a sequence of sub-tokens, as did in P5. For example, an item is randomly assigned the number "2024", and "2024" is represented as a sequence of tokens "20""24". SID is a straightforward method to leverage collaborative information for item indexing, where items interacted consecutively by a user are assigned consecutive numerical indices, reflecting their co-occurrence. CID approach employs spectral clustering based on spectral matrix factorization to generate item IDs. This method is based on the premise that items with more frequent co-occurrence are more similar and should share more overlapping tokens in ID construction. The results for all baselines except P5 with its variations are reproduced through open source code[49].

C.3 Implementation Details

As mentioned in AppendixA, we generate the META ID for users and items for each dataset, generalized to all experiments below. For constructing META IDs, we sampling rating-based meta-paths in each dataset, where adjacent users assigned equal ratings to an item. We set the sampling path length to 64, and use a skip-gram model for training, with a window size of 5 and learning rate set at $1e^{-3}$ . The embedding clusters groups are limited to $|G|=100$ (200 for Toys) to manage vocabulary size effectively. The OOV tokens size is shown in Table9.

OOV Tokens Size	Sports	Beauty	Toys
RID	0	0	0
SID	0	0	0
CID	448	437	487
IID	18357	12101	11924
META ID	1600	1319	727

C.3.1 Evaluation on Sequential and Direct Recommendation Tasks

We first evaluate META ID on sequential recommendation tasks and direct recommendation tasks following[17]. Our implementation first utilizes T5[28] as the backbone with parameters around 60.75 million. As mentioned in subsection4.3, we add a linear layer where the OOV tokens undergo an extra linear transformation before the token embedding layer for a better initialization with $\alpha=0.1$ . We also consider decoder-only architecture LLaMA2-7b[37] with 7B parameters. For tokenization, we use the default SentencePiece tokenizer with extended OOV tokens for parsing sub-word units. We use the same sequential recommendation and direct recommendation prompts in P5[13] to convert sequential information into texts.

For LLM fine-tuning, we pre-train T5 for 10 epochs using AdamW optimizer on two NVIDIA RTX 3090 GPUs with a batch size of 64, a peak learning rate of $1e^{-3}$ . We apply warm-up for the first 5% of all training steps to adjust the learning rate, a maximum input token length of 1024. We use the lora[16] technique to fine-tune the token embedding layer and linear head layer of LLaMA2-7b for 1 epochs using AdamW optimizer on two NVIDIA RTX A6000 GPUs with a batch size of 28, a peak learning rate of $1e^{-5}$ , the lora attention dimension of 16 and the alpha parameter of 32.

For LLM inferencing, beam search is utilized to generate a list of potential next items, evaluated under the all-item setting. To prevent the generation of non-existent IDs, we apply a constrained decoding method[17], setting the generation probability of invalid IDs to zero.

Task TypeMetricSportsBeautyToysRIDSIDCIDMETA IDRIDSIDCIDMETA IDRIDSIDCIDMETA IDRatingRMSE1.03821.04861.03831.03271.28291.30981.28191.28181.07251.06931.07661.0770Explan.BLEU-116.256716.582516.612116.900518.229918.398119.349919.510619.985820.219820.457020.2270BLEU-42.17822.19442.23322.34812.90272.80713.06263.05924.34954.37014.58444.4945ReviewBLEU-17.61407.79487.65867.88196.22826.50556.58547.05008.53368.08627.38468.3080BLEU-42.32281.24062.41092.65461.98911.24061.97182.74851.13151.83661.21281.7061

C.3.2 Evaluation on Rating, Explanation and Review Tasks

To validate META ID’s adaptability, we extend our evaluation to rating prediction, explanation generation, and review tasks, akin to P5. We use the same prompts in P5[13] to convert all information into training texts.

For LLM fine-tuning, we pre-train T5 for 10 epochs using AdamW optimizer on two NVIDIA RTX 3090 GPUs with a batch size of 32, a peak learning rate of $1e^{-3}$ . We apply warm-up for the first 5% of all training steps to adjust the learning rate, a maximum input token length of 512 and maximum generation length of 64.

For LLM inferencing, greedy decoding is applied for rating prediction, explanation generation, and review tasks.

The full results are shown in Table10.

Appendix D Additional Experimental Results

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (11)

Initialization Approaches. We explore the impact of different token initialization methods on the performance of META ID. Recognizing that the LLM’s vocabulary includes numeric tokens for linguistic IDs, we first consider whether reinitializing these numeric tokens helps LLM for recommendation. As shown in Figure 7, our experiment, contrasting random initializing numeric tokens (Random Init.) with keeping T5’s original token embeddings (Embedding Init.), reveals that random initialization does not enhance performance, and be detrimental on Sports and Toys datasets. This suggests that the influence of pre-training on these tokens cannot be effectively negated through simple random initialization, thus not suitable for building IDs. This result emphasizes the importance of introducing extra tokens for IDs in LLM-based recommender systems. We also compare two initialization approaches for META ID: random initialization (Random Init.) and initializing OOV tokens using the augmentation (Embedding Init.) (See Section 4.2). Our findings show that the latter method substantially improves META ID’s performance, underlining the critical nature of the token initialization method in achieving better results.

Visualization of ID-related tokens. Directly applying in-vocabulary tokens to construct IDs (RID and SID) brings poor performance in Toys dataset. In Figure8a, We use t-SNE visualization to map ID token embeddings and observed that these tokens tend to be hom*ogeneous, whereas CID and META IDs that use OOV tokens to construct IDs have a wider distribution, reflecting difference and diversity between their representations.

To further illustrate the impact on representation, we visualized the attention mechanism in sequence recommendation generation in Figure8(b). This revealed that SID leads to uniform attention patterns, not distinguishing between different items and user IDs. In contrast, META ID demonstrates distinct attention patterns, successfully differentiating items and emphasizing user IDs, thereby allowing models to grasp more personalized and distinct information.

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (12)

DatasetsHR@5NCDG@5HR@10NCDG@10Beauty0.0510 ± 0.000380.0351 ± 0.000440.0752 ± 0.001310.0429 ± 0.00075Sports0.0322 ± 0.000610.0223 ± 0.000600.0487 ± 0.000730.0277 ± 0.00055Toys0.0503 ± 0.000910.0352 ± 0.000670.0742 ± 0.001380.0429 ± 0.00078

Models	per user		per user-item pair		per review
	Sequential	Direct	Rating	Explanation	Summarization	Preference
META ID (T)	74.05	68.60	5.21	17.28	9.67	8.55

Statistics on training & Inference Time. We provide statistics on the training and inference time of P5 models, we collect the running time on the Toys dataset.As mentioned in subsectionC, we trained and test our models on two RTX 3090 GPUs. For training on sequential recommendation and direct recommendation tasks, the T5 model spent 3.5 hours to finish training. The average inference time of T5 model on dferent tasks are presented in Table12. Sequential and direct recommendation tasks require much longer inference time than other tasks due to the beam search step. Overall, the inference is very fast. It is also promising to further reduce the training and inference time with the help of effcient Transformer techniques.

Appendix E Discussions

There are two promising directions of META ID. First, META ID uses a fixed database of users and items, while newly appearing items and users do not have interaction history. This could be solved using methods related to the cold start issue. Second, META ID applies two-level tokens for constructing IDs, while a more complicated hierarchical structure could be considered. Then, META ID could be applied to modern recommender system containing trillions of users and items.