Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (2024)

Ting-Ji Huang1,2 Jia-Qi Yang1,2 Chunxu Shen3 Kai-Qi Liu4
De-Chuan Zhan1,2 Han-Jia Ye1,2
1
School of Artificial Intelligence, Nanjing University
2National Key Laboratory for Novel Software Technology, Nanjing University
3WeChat Technical Architecture Department, Tencent Inc.
4Software Institute, Nanjing University
Corresponding author, email: yehj@lamda.nju.edu.cn.

Abstract

Characterizing users and items through vector representations is crucial for various tasks in recommender systems. Recent approaches attempt to apply Large Language Models (LLMs) in recommendation through a question&answer format, where real users and items (e.g., Item No.2024) are represented with in-vocabulary tokens (e.g., “item”, “20”, “24”). However, since LLMs are typically pretrained on natural language tasks, these in-vocabulary tokens lack the expressive power for distinctive users and items, thereby weakening the recommendation ability even after fine-tuning on recommendation tasks. In this paper, we explore how to effectively tokenize users and items in LLM-based recommender systems. We emphasize the role of out-of-vocabulary (OOV) tokens in addition to the in-vocabulary ones and claim the memorization of OOV tokens that capture correlations of users/items as well as diversity of OOV tokens. By clustering the learned representations from historical user-item interactions, we make the representations of user/item combinations share the same OOV tokens if they have similar properties. Furthermore, integrating these OOV tokens into the LLM’s vocabulary allows for better distinction between users and items and enhanced capture of user-item relationships during fine-tuning on downstream tasks. Our proposed framework outperforms existing state-of-the-art methods across various downstream recommendation tasks.

1 Introduction

Modern recommender systems (RS) play a crucial role in various applications like video recommendation[7], e-commerce[4], and social networking[10]. The recent advent of large language models (LLMs) offers a new direction of exploration in this realm. Models such as T5[29] and LLaMA[37], training on massive natural language data, achieve impressive language understanding capabilities in text generation and conversation tasks[48]. Their success drives explorations to use pre-trained LLMs as model backbones for handling various recommendation tasks like sequential recommendation, rating, and explanation task[3, 42]. In such frameworks, we describe task inputs as a query and expect an LLM to directly answer task labels[6, 13]. After fine-tuning through this Q&A format, we could ask the LLM to recommend something for a user and receive a real item answered.

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (1)
Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (2)

How to characterize users and items so that LLMs can learn their personalized information during fine-tuning? Since tokens are the basic building blocks for LLMs and represent the smallest unit of text the model can understand and process, we should allocate some tokens to represent users/items and let LLMs learn their characteristics through tokens. While early works[6] directly represent a real item by text tokens (e.g., “blue”, “casual”, “T-shirt”), LLMs might answer text that does not correspond to a real existing item[17, 18]. Recent advancements explore representing users/items with Identifiers (IDs)[13, 40, 12, 2]. As illustrated in Figure LABEL:fig:teaser_a, a real item (Item No.2024) can be indexed with a sequence of in-vocabulary tokens (“item”, “20”, “24”), and we call such a token combination as an ID, where a numeric ID indicates that we use numeric tokens (“20”, “24”).

Characterizing users and items with numeric IDs is straightforward but also introduces the mapping problem, where it is hard to align limited numeric tokens to thousands of items in recommender systems, and token combinations are prone to lead to language conflicts[2, 17]. As shown in Figure LABEL:fig:teaser_b, using numeric tokens to represent items leads to similar representations of distinctive items. One solution is to create OOV tokens specifically allocated for IDs to allow distinction, as in the previous work on Collaborative ID (CID)[17].However, our experiments indicate that the use of OOV tokens defined in CID may improve diversity, it is hard for LLM to learn the correlations of users/items during fine-tuning (see Section3.3). Therefore, we should utilize redefined OOV tokens capable of capturing user/item correlations (memorization) while also distinguishing different items (diversity), two dimensions that are positively correlated for the performance of recommendation tasks.

In this paper, we present META ID (META-path-guided IDentifier), a framework for characterizing users/items using out-of-vocabulary (OOV) tokens for LLM-based recommendations. Initially, we generate meta-paths to represent user-item interactions and then obtain user and item representations from a skip-gram model trained on these meta-paths. A meta-path is a sequence that represents interactions between users and items in a graph structure. By clustering these meta-path-based representations, we create hierarchical groups that serve as OOV tokens for constructing user and item IDs. This approach extends beyond previous research, which has predominantly focused on item IDs[13, 30], by making the representations of users/item IDs share the same OOV tokens if they have similar properties. Finally, integrating these OOV tokens into the LLM’s vocabulary allows for better diversity and enhanced memorization during fine-tuning on downstream recommendation tasks. Additionally, we align the token embedding layer of LLMs with a linear transformation layer to enrich the OOV token representations as an augmentation. Our contributions are as follows:

  • We introduce memorization and diversity scores to evaluate ID representations in LLM-based recommender systems, focusing on capturing user/item correlations and ensuring diversity.

  • We develop META ID, which uses out-of-vocabulary tokens to characterize users/items, enhancing the memorization and diversity of their representations for LLMs.

  • The experiments show that META ID improves memorization and diversity score, which leads to improvements in various recommendation tasks.

2 Related Work

Characterizing Items by IDs. Modern recommendation models usually use unique IDs to represent users and items, which are subsequently converted to embedding vectors as learnable parameters[11]. Common approaches include matrix factorization[21, 32], two-tower models[38] and deep neural networks[35, 19, 49, 46, 43, 39], which make predictions by examining historical user-item interactions to identify behavioral patterns and enable collaborative recommendations[22]. Some recent approaches adopt the concept of ID to represent the token combinations that characterize items in LLMs[13, 17]. We follow previous studies and also call these token combinations as ID. This paper aims to combine LLM with ID more efficiently by proposing a new ID construction method.

Instruction Tuning for Recommendation. The integration of Large Language Models (LLMs) into diverse tasks has seen significant growth recently. Recent works fine-tuned pretrained language models on large-scale NLP datasets verbalized via human-readable prompts[33, 41]. These instruction tuning methods design prompts containing detailed task descriptions and adhere more to the natural language format. Driven by their exceptional natural language processing capabilities, researchers aim to transfer their linguistic ability to enhance recommender systems[25, 45]. These LLMs process user interactions as sequences of tokens and, through fine-tuning, predict users’ future interests based on past activities[47, 5]. Moreover, some studies reframe tasks like retrieval, rating, and explanation generation as language comprehension tasks[6, 13, 23], allowing LLMs to function as multi-task recommenders, producing recommendations and explanations with a unified architecture. In this paper, we apply LLMs based on instruction tuning of multi-task scenarios.

ID Construction with Tokens. Recent studies explore using in-vocabulary tokens to characterize items in LLMs, where users and items are represented by token combinations called IDs. Early efforts such as P5[13] convert user-item interactions into natural language formats using numeric IDs constructed of in-vocabulary tokens of the T5 model[29]. Further, sequential ID and collaborative ID are developed to enhance item information sharing[17]. A relevant study[30], constructs IDs using hierarchical tokens through RQ-VAE. Despite these advancements in ID construction, challenges remain in the lack of ID construction criteria and the focus on item IDs. In contrast, our META ID approach assigns memorization and diversity scores for evaluation and constructs both user and item IDs, which improves the characterization of users and items.

3 Preliminary

We describe recommendation tasks (e.g., sequential recommendation) under the instruction tuning setting, in which all data such as user-item interactions are converted to natural language sequences in a question&answer format (e.g., Input: Please recommend an item for user 2024 consider he has purchased item 2023; Output: item 2024). Details of this format are presented in AppendixC.

3.1 Instruction Tuning for Recommendation

Given a user set 𝐔𝐔\mathbf{U}bold_U, an item set 𝐈𝐈\mathbf{I}bold_I, we formulate a recommendation task as a natural language instruction that pairs an input token sequence 𝐱=(x1,,xn,𝒙u,𝒙i)𝐱subscript𝑥1subscript𝑥𝑛subscript𝒙𝑢subscript𝒙𝑖\mathbf{x}=(x_{1},...,x_{n},\boldsymbol{x}_{u},\boldsymbol{x}_{i})bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with a corresponding label token sequence 𝐲=(y1,,ym,𝒙u,𝒙i)𝐲subscript𝑦1subscript𝑦𝑚subscript𝒙𝑢subscript𝒙𝑖\mathbf{y}=(y_{1},...,y_{m},\boldsymbol{x}_{u},\boldsymbol{x}_{i})bold_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The goal is to train and LLM 𝜽subscript𝜽\mathcal{M}_{{\boldsymbol{\theta}}}caligraphic_M start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT to generate 𝐲𝐲\mathbf{y}bold_y given 𝐱𝐱\mathbf{x}bold_x. For simplification, we denote one token as xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and represent the ID of an user u𝐔𝑢𝐔u\in\mathbf{U}italic_u ∈ bold_U by 𝒙u=(xu1,xu2,)subscript𝒙𝑢subscript𝑥subscript𝑢1subscript𝑥subscript𝑢2\boldsymbol{x}_{u}=(x_{u_{1}},x_{u_{2}},...)bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … ) as its token combination, using the same representation format to a item ID 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The LLM 𝜽subscript𝜽\mathcal{M}_{{\boldsymbol{\theta}}}caligraphic_M start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT employs a token embedding layer 𝑬()𝑬\boldsymbol{E}(\cdot)bold_italic_E ( ⋅ ) with parameters 𝜽𝑬𝜽subscript𝜽𝑬𝜽{{\boldsymbol{\theta}}}_{\boldsymbol{E}}\in{{\boldsymbol{\theta}}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_E end_POSTSUBSCRIPT ∈ bold_italic_θ, functioning as a lookup table that transforms each input token into a token embedding. It subsequently predicts the probability distribution of label tokens by forward propagation. The training objective is to minimize the negative log-likelihood of the label tokens conditioned on the input sequence and previously generated tokens, formulated as:

𝜽=argmin𝜽𝜽=j=1|𝐲|logP𝜽(yjy<j,𝑬(𝐱)),superscript𝜽subscriptargmin𝜽subscript𝜽superscriptsubscript𝑗1𝐲subscript𝑃𝜽conditionalsubscript𝑦𝑗subscript𝑦absent𝑗𝑬𝐱\small{\boldsymbol{\theta}}^{*}=\operatorname*{arg\,min}_{{\boldsymbol{\theta}%}}\mathcal{L}_{{\boldsymbol{\theta}}}=-\sum_{j=1}^{|\mathbf{y}|}\log P_{{%\boldsymbol{\theta}}}\big{(}y_{j}\mid y_{<j},\boldsymbol{E}(\mathbf{x})\big{)},bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_y | end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT , bold_italic_E ( bold_x ) ) ,(1)

where 𝜽superscript𝜽{\boldsymbol{\theta}}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the optimal set of parameters that we aim to learn. This supervised learning approach helps the model to internalize personalized information about users and items through the token embeddings of their respective IDs, 𝒙usubscript𝒙𝑢\boldsymbol{x}_{u}bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

3.2 Represent Users/Items with In-Vocabulary Tokens

The token embedding layer 𝑬𝑬\boldsymbol{E}bold_italic_E in LLMs transforms input tokens into corresponding token embeddings. Since users and items are also represented by IDs constructed of tokens, the corresponding token embeddings of ID 𝒙usubscript𝒙𝑢\boldsymbol{x}_{u}bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the key to assess how LLMs capture and represent different items.We define the ID representation for an ID 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the average embeddings of its token combinations:

𝐞i=1|𝒙i|x𝒙i𝑬(x),subscript𝐞𝑖1subscript𝒙𝑖subscript𝑥subscript𝒙𝑖𝑬𝑥\mathbf{e}_{i}=\frac{1}{|\boldsymbol{x}_{i}|}\sum_{x\in\boldsymbol{x}_{i}}%\boldsymbol{E}(x),bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_E ( italic_x ) ,(2)

where 𝑬(x)𝑬𝑥\boldsymbol{E}(x)bold_italic_E ( italic_x ) denotes the embedding vector of a token x𝑥xitalic_x. For example, an item identified by “item2024” is represented by the average embedding vectors of three tokens (“item”, “20”, “24”).

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (3)
Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (4)

3.3 Metric for Representation Evaluation

As mentioned above, using numeric IDs to represent items leads to similar representations of distinctive items. For an intuitive explanation, we first visualize the cosine similarity matrix between items using heatmaps in Figure 1, where RID and SID result in a large number of similar items due to semantic conflicts, and META ID constructed of OOV tokens shows distinguished similarity closer to the ground truth. We further plot the ID representations using T-SNE visualization in FigureLABEL:fig:score-tsne. It is clear that RID and SID (using in-vocabulary tokens) shrink in a small place relative to CID (using OOV tokens), reflecting in-vocabulary tokens lacking expressive power for distinctive users and items. Details are in Section5.3.

To quantify this, we introduce two metrics to assess the memorization and diversity of ID representations.

Diversity Score (DS) is a metric designed to quantify the diversity of ID representations within LLMs. Items represented by ID representations should be easily distinguishable for the model. For a pair of items i𝑖iitalic_i and j𝑗jitalic_j, their differentiation can be calculated using the Kullback-Leibler (KL) divergence on ID representations 𝐞iEsubscript𝐞𝑖𝐸\mathbf{e}_{i}\in Ebold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_E and 𝐞j𝐄subscript𝐞𝑗𝐄\mathbf{e}_{j}\in\mathbf{E}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_E from Equation2, which reflects how distinct they are in the model’s embedding space:

DS(𝐄)=12Nn=1N[DKL(𝐞i||𝐞j)+DKL(𝐞j||𝐞i)].\small\text{DS}(\mathbf{E})=\frac{1}{2N}\sum_{n=1}^{N}{\bigg{[}{D_{KL}(\mathbf%{e}_{i}||\mathbf{e}_{j})+D_{KL}(\mathbf{e}_{j}||\mathbf{e}_{i})}\bigg{]}}.DS ( bold_E ) = divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] .(3)

Here, N𝑁Nitalic_N represents the number of randomly selected item pairs, where we apply a sampling strategy to reduce the computational complexity, and we give a convergence analysis in Figure6.

Memorization Score (MS) quantifies the relationships captured by ID representations through measuring the similarity between items and users. We use the adjusted cosine similarity formula[34], to provides ground truth relational values between users and items, which are given by:

sim(i,j)=uU(Ru,iR¯u)(Ru,jR¯u)uU(Ru,iR¯u)2uU(Ru,jR¯u)2,sim𝑖𝑗subscript𝑢𝑈subscript𝑅𝑢𝑖subscript¯𝑅𝑢subscript𝑅𝑢𝑗subscript¯𝑅𝑢subscript𝑢𝑈superscriptsubscript𝑅𝑢𝑖subscript¯𝑅𝑢2subscript𝑢𝑈superscriptsubscript𝑅𝑢𝑗subscript¯𝑅𝑢2\small\text{sim}(i,j)=\frac{\sum_{u\in U}(R_{u,i}-\bar{R}_{u})\cdot(R_{u,j}-%\bar{R}_{u})}{{\sqrt{\sum_{u\in U}{(R_{u,i}-\bar{R}_{u})^{2}}\cdot{\sum_{u\in U%}{(R_{u,j}-\bar{R}_{u})^{2}}}}}},sim ( italic_i , italic_j ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ⋅ ( italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ,(4)

where Ru,isubscript𝑅𝑢𝑖R_{u,i}italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT and Ru,jsubscript𝑅𝑢𝑗R_{u,j}italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT denote user u𝑢uitalic_u’s ratings for items i𝑖iitalic_i and j𝑗jitalic_j, and R¯usubscript¯𝑅𝑢\bar{R}_{u}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is user u𝑢uitalic_u’s average ratings. To assess the relationship captured by learned ID representations, we employ Mean Square Error (MSE) to calculate the similarity bias on their cosine similarity with their corresponding ground truth relation values, which forms the basis of the MS:

MS(𝐄)=2|N|(|N|1)i,jN,i<j(𝐞i𝐞j|𝐞i||𝐞j|sim(i,j))2.MS𝐄2𝑁𝑁1subscriptformulae-sequence𝑖𝑗𝑁𝑖𝑗superscriptsuperscriptsubscript𝐞𝑖topsubscript𝐞𝑗subscript𝐞𝑖subscript𝐞𝑗sim𝑖𝑗2\small\text{MS}(\mathbf{E})=\frac{2}{|N|(|N|-1)}\sum_{i,j\in N,i<j}\left(\frac%{\mathbf{e}_{i}^{\top}\mathbf{e}_{j}}{|\mathbf{e}_{i}||\mathbf{e}_{j}|}-\text{%sim}(i,j)\right)^{2}.MS ( bold_E ) = divide start_ARG 2 end_ARG start_ARG | italic_N | ( | italic_N | - 1 ) end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_N , italic_i < italic_j end_POSTSUBSCRIPT ( divide start_ARG bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG | bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG - sim ( italic_i , italic_j ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(5)

Our quantitative assessment of the quality of these ID representations using the introduced metrics reveals some interesting findings. In FigureLABEL:fig:score, it shows that constructing IDs of in-vocabulary tokens (RID and SID) performs poorly in both diversity and memorization dimensions. While the use of OOV tokens (CID) may improve diversity, its memorization score leaves much to be desired. This suggests the need to explore better explore better forms of OOV tokens to construct IDs, to make them capture user/item correlations while also distinguishing different items.

Our quantitative assessment, based on the metrics we introduced, reveals intriguing insights into the quality of these ID representations. FigureLABEL:fig:score illustrates that IDs constructed from in-vocabulary tokens (RID and SID) perform poorly in terms of both diversity and memorization. Although the use of OOV tokens (CID) enhances diversity, its memorization score is unsatisfactory. These findings highlight the need to further develop more effective forms of OOV tokens for ID construction, aiming to improve LLMs’ ability to capture user/item correlations and distinguish items.

4 META ID

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (5)
Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (6)

We now introduce the META ID framework to enhance LLMs in recommendation tasks. META ID involves creating out-of-vocabulary (OOV) tokens for constructing user and item IDs, which provide a rich, expressive space and encapsulate comprehensive, collaborative information. As illustrated in Figure 3, our process begins with sampling meta-paths from user-item interaction history. We then apply a skip-gram model to learn user and item representations from these meta-path sequences. This process ensures that users’ and items’ features are projected into a shared space to capture their interaction relationship better. This is followed by K-Means clustering to group similar users or items, after which unique OOV tokens are assigned to each cluster to construct the META ID.

4.1 Meta-path-based Embedding

We frame user-item interactions within a graph embedding learning paradigm[8], constructing an interaction graph composed of user nodes U𝑈Uitalic_U and item nodes I𝐼Iitalic_I, linked by interaction history with ratings R𝑅Ritalic_R. The core of this embedding learning strategy involves meta-paths[8] — a sequence of connections that reflect composite relationships within the graph. Our primary meta-path, denoted as U-I-U, materializes when users consistently rate an item with the same score Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, forming a path:

p={U1RiI2RiU3RiI4RiUk}i=1|R|.𝑝subscriptsuperscriptsubscript𝑅𝑖subscript𝑈1subscript𝐼2subscript𝑅𝑖subscript𝑈3subscript𝑅𝑖subscript𝐼4subscript𝑅𝑖subscript𝑈𝑘𝑅𝑖1\small p=\{U_{1}\xrightarrow{R_{i}}I_{2}\xrightarrow{R_{i}}U_{3}\xrightarrow{R%_{i}}I_{4}...\xrightarrow{R_{i}}U_{k}\}^{|R|}_{i=1}.italic_p = { italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_U start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_I start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT … start_ARROW start_OVERACCENT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT | italic_R | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT .(6)

We employ a skip-gram model to capture the interaction dynamics represented by these meta-paths. This model is trained on sequences generated from meta-path-based random walks[8], producing representations 𝐖𝐔=[𝐰u1,,𝐰um]d×msubscript𝐖𝐔superscriptsubscript𝐰subscript𝑢1topsuperscriptsubscript𝐰subscript𝑢𝑚topsuperscript𝑑𝑚\mathbf{W_{U}}=[\mathbf{w}_{u_{1}}^{\top},...,\mathbf{w}_{u_{m}}^{\top}]\in%\mathbb{R}^{d\times m}bold_W start_POSTSUBSCRIPT bold_U end_POSTSUBSCRIPT = [ bold_w start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , bold_w start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_m end_POSTSUPERSCRIPT and 𝐖𝐈=[𝐰i1,,𝐰in]d×nsubscript𝐖𝐈superscriptsubscript𝐰subscript𝑖1topsuperscriptsubscript𝐰subscript𝑖𝑛topsuperscript𝑑𝑛\mathbf{W_{I}}=[\mathbf{w}_{i_{1}}^{\top},...,\mathbf{w}_{i_{n}}^{\top}]\in%\mathbb{R}^{d\times n}bold_W start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT = [ bold_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , bold_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_n end_POSTSUPERSCRIPT, denoted the representations for m𝑚mitalic_m users and n𝑛nitalic_n items in a d𝑑ditalic_d-dimension space respectively. Through this process, we acquire deep representations of user and item interactions, which are instrumental in enhancing the accuracy and relevance of our recommendation system.

4.2 OOV Token Generation

To balance diversity and memorization in large-scale recommendation systems, we adopt a cluster-based approach. Representations derived from user-item interactions are organized into a shared embedding space and segmented into G𝐺Gitalic_G clusters. Each cluster center, μgsubscript𝜇𝑔\mathbf{\mu}_{g}italic_μ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, is defined by the average of representations within that cluster, effectively capturing the collective characteristics of its members. These centroids then categorize each user and item, providing a refined foundation for constructing granular IDs. The generation process is a two-step procedure:

1). Assign coarse-grained tokens based on centroids. We first cluster the learned representations W and set the number of clusters as G𝐺Gitalic_G. Each cluster centroid μgsubscript𝜇𝑔\mathbf{\mu}_{g}italic_μ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is calculated as:

μg={1|𝕀(gi=g)|(𝐰i,gi)(𝐖,G)[𝐰i𝕀(gi=g)]}g[1,G],subscript𝜇𝑔subscript1𝕀subscript𝑔𝑖𝑔subscriptsubscript𝐰𝑖subscript𝑔𝑖𝐖𝐺delimited-[]subscript𝐰𝑖𝕀subscript𝑔𝑖𝑔𝑔1𝐺\small\mathbf{\mu}_{g}=\left\{\frac{1}{|\mathbb{I}(g_{i}=g)|}\sum_{(\mathbf{w}%_{i},g_{i})\in(\mathbf{W},G)}{\big{[}\mathbf{w}_{i}\cdot\mathbb{I}(g_{i}=g)%\big{]}}\right\}_{g\in[1,G]},italic_μ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = { divide start_ARG 1 end_ARG start_ARG | blackboard_I ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ) | end_ARG ∑ start_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ ( bold_W , italic_G ) end_POSTSUBSCRIPT [ bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ blackboard_I ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ) ] } start_POSTSUBSCRIPT italic_g ∈ [ 1 , italic_G ] end_POSTSUBSCRIPT ,(7)

in which we use i𝑖iitalic_i to represent a user/item for simplification. As a coarse-grained distinction between users and items, we assign an OOV token “CTidelimited-⟨⟩𝐶subscript𝑇𝑖\left\langle CT_{i}\right\rangle⟨ italic_C italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩” to each centroid.

2). Assign fine-grained tokens based on distance. Within each cluster, assign a token based on the distance to the centroid. In detail, we assign fine-grained tokens “yidelimited-⟨⟩subscript𝑦𝑖\left\langle y_{i}\right\rangle⟨ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩” in ascending order according to their distance from the cluster centroid, and use it as a fine-grained token.

The resulting identifiers, META ID, combine a coarse-grained token and a fine-grained one that uniquely identifies each user or item within that cluster. For example, an item might be represented as “ItemCTiyidelimited-⟨⟩Itemdelimited-⟨⟩𝐶subscript𝑇𝑖delimited-⟨⟩subscript𝑦𝑖\left\langle\text{{Item}}\right\rangle\left\langle CT_{i}\right\rangle\left%\langle y_{i}\right\rangle⟨ Item ⟩ ⟨ italic_C italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ⟨ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩” labeled with three tokens, “Itemdelimited-⟨⟩Item\left\langle\text{{Item}}\right\rangle⟨ Item ⟩”, “CTidelimited-⟨⟩𝐶subscript𝑇𝑖\left\langle CT_{i}\right\rangle⟨ italic_C italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩”, “yidelimited-⟨⟩subscript𝑦𝑖\left\langle y_{i}\right\rangle⟨ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩”, where “Itemdelimited-⟨⟩Item\left\langle\text{{Item}}\right\rangle⟨ Item ⟩” denotes it as an item, “CTidelimited-⟨⟩𝐶subscript𝑇𝑖\left\langle CT_{i}\right\rangle⟨ italic_C italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩” is its coarse-grained token and “yidelimited-⟨⟩subscript𝑦𝑖\left\langle y_{i}\right\rangle⟨ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩” is its fine-grained token.

This clustering and labeling process effectively compresses the vocabulary needed for ID representation while preserving the rich information necessary for recommendation tasks. In our implementation, we apply K-Means clustering[1], utilizing cosine similarity to measure the affinity of representations to cluster centers. This method simplifies the complexity of representations space management and proves to be robust in our experimental validations of Section5.4.

4.3 Integration of META ID with LLMs

The integration of OOV tokens of META ID, denoted as 𝐱OOVsubscript𝐱OOV\mathbf{x_{\text{OOV}}}bold_x start_POSTSUBSCRIPT OOV end_POSTSUBSCRIPT, with LLMs involves expanding the vocabularies. This is achieved by extending the token embedding layer’s parameters from 𝜽𝑬N×dsubscript𝜽𝑬superscript𝑁𝑑{\boldsymbol{\theta}}_{\boldsymbol{E}}\in\mathbb{R}^{N\times d}bold_italic_θ start_POSTSUBSCRIPT bold_italic_E end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT to 𝜽𝑬(N+n)×dsubscript𝜽superscript𝑬superscript𝑁𝑛𝑑{\boldsymbol{\theta}}_{\boldsymbol{E}^{\prime}}\in\mathbb{R}^{(N+n)\times d}bold_italic_θ start_POSTSUBSCRIPT bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + italic_n ) × italic_d end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of in-vocabulary tokens, n𝑛nitalic_n is the number of OOV tokens, and d𝑑ditalic_d is the dimension of the token embeddings.

A good initialization helps token embedding learning, here we give a representation augmentation approach different from previous works using random initialization[13, 17]. As shown in FigureLABEL:fig:method_b, the OOV tokens undergo a linear layer 𝑭()𝑭\boldsymbol{F}(\cdot)bold_italic_F ( ⋅ ) initialized with the category embeddings μgsubscript𝜇𝑔\mathbf{\mu}_{g}italic_μ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT from Equation7. Finally, the training objective for integrating META ID into LLMs is reformulated as:

𝜽=argmin(𝜽,𝑭)𝜽=j=1|𝐲|logPθ(yjy<j,𝑬(𝐱),𝑭(𝐱OOV)),superscript𝜽subscriptargminsuperscript𝜽𝑭subscriptsuperscript𝜽superscriptsubscript𝑗1𝐲subscript𝑃superscript𝜃conditionalsubscript𝑦𝑗subscript𝑦absent𝑗superscript𝑬𝐱𝑭subscript𝐱OOV\small{\boldsymbol{\theta}}^{\prime*}=\operatorname*{arg\,min}_{({\boldsymbol{%\theta}}^{\prime},\boldsymbol{F})}\mathcal{L}_{{\boldsymbol{\theta}}^{\prime}}%=-\sum_{j=1}^{|\mathbf{y}|}\log P_{\theta^{\prime}}\big{(}y_{j}\mid y_{<j},%\boldsymbol{E}^{\prime}(\mathbf{x}),\boldsymbol{F}(\mathbf{x_{\text{OOV}}})%\big{)},bold_italic_θ start_POSTSUPERSCRIPT ′ ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_F ) end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_y | end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT , bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_x ) , bold_italic_F ( bold_x start_POSTSUBSCRIPT OOV end_POSTSUBSCRIPT ) ) ,(8)

where 𝜽superscript𝜽{\boldsymbol{\theta}}^{\prime}bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT now includes 𝑭𝑭\boldsymbol{F}bold_italic_F, aligning with our modified embedding layer 𝜽𝑬subscript𝜽superscript𝑬{\boldsymbol{\theta}}_{\boldsymbol{E}^{\prime}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to optimize the model’s performance with the OOV tokens 𝐱OOVsubscript𝐱OOV\mathbf{x_{\text{OOV}}}bold_x start_POSTSUBSCRIPT OOV end_POSTSUBSCRIPT. This training goal ensures the model can effectively distinguish between diverse users and items, improving its ability to capture user-item relationships. By enhancing memorization and diversity of OOV tokens, the model achieves better performance in recommendation tasks, leading to more accurate and personalized recommendations in our experiments.

MethodsSportsBeautyToysH@5N@5H@10N@10H@5N@5H@10N@10H@5N@5H@10N@10Caser [36]0.01160.00720.01940.00970.02050.01310.03470.01760.01660.01070.02700.0141HGN [26]0.01890.01200.03130.01590.03250.02060.05120.02660.03210.02210.04970.0277GRU4Rec [15]0.01290.00860.02040.01100.01640.00990.02830.01370.00970.00590.01760.0084BERT4Rec [35]0.01150.00750.01910.00990.02030.01240.03470.01700.01160.00710.02030.0099FDSA [14]0.01820.01220.02880.01560.02670.01630.04070.02080.02280.01400.03810.0189SASRec [19]0.02330.01540.03500.01920.03870.02490.06050.03180.04630.03060.06750.0374S3-Rec [49]0.02510.01610.03850.02040.03870.02440.06470.03270.04430.02940.07000.0376CL4SRec [43]0.02190.01380.03580.01820.03300.02010.05460.02700.04270.02440.06170.0305RID [17]0.02080.01220.02880.01530.02130.01780.04790.02770.00440.00290.00620.0035SID [13]0.02230.01730.02940.01960.04040.02990.05730.03540.00500.00310.00880.0043CID [17]0.02690.01960.03780.02310.03360.02270.05070.02810.01720.01090.02790.0143META ID (T)0.03220.02230.04870.02770.05100.03510.07530.04290.05030.03520.07420.0429META ID (L)0.03920.02780.05610.03320.04580.03200.06780.03910.03870.02640.05350.0312

5 Experiment

We evaluate META ID on five downstream recommendation tasks: sequential recommendation, direct recommendation, rating prediction, explanation generation, and review related tasks. We analyze the influence of critical components in META ID and assess the ID representations through visualization and our proposed metrics. Details of task descriptions and pre-processing are in Appendix C.

5.1 Evaluation on Sequential Recommendation

Setups. We evaluate our META ID framework on three public real-world datasets from the Amazon Product Reviews dataset [27], focusing specifically on Sports, Beauty, and Toys. The datasets are processed following the methodology in P5[13].

Baselines. We compare to a variety of established models (which are described briefly in Appendix C), spanning from CNN-based to LLM-based frameworks. Caser [36], HGN [26], GRU4Rec [15], BERT4Rec [35], FDSA [14], SASRec [19], S3-Rec [49] and CL4SRec [43]. Specifically, we provide P5 with its variations, equipped with different ID construction strategies like Sequential ID (SID), Random ID (RID), and Collaborative ID (CID)[17].

Evaluations. We apply widely accepted metrics, top-k Hit ratio (H@K) and Normalized discounted cumulative gain (N@K) with K = 5, 10 to evaluate the recommendation performance.

Implementation Details. For constructing META IDs, the clustering groups are limited to |G|𝐺|G|| italic_G |=100 (200 for Toys). For LLM fine-tuning, we consider both encoder-decoder architecture T5-small [28] and decoder-only architecture LLaMA2-7b [37]. We fully fine-tune the T5 model and employ the LoRA [16] to fine-tune LLaMA2-7b. Vocabulary sizes of these models are shown in Table9. For LLM inferencing, we use beam search to generate potential items evaluated under the all-item setting.

Results. Table 1 presents our findings for sequential recommendation task111We show the standard error of the metrics for META ID in Table11.. Our observations are as follows: 1) META ID demonstrates superior performance on all three datasets, underscoring its robustness. 2) IDs constructed of in-vocabulary tokens, RID and SID, underperform on Toys, suggesting limitations in their recommendation efficacy for LLMs. 3) CID shows marked improvements over RID and SID on Toys dataset, highlighting the benefits of incorporating OOV tokens with collaborative information. 4). While the LLaMA2-7b backbone is better in the Sports dataset, its performance in the Beauty and Toys dataset is not as good as T5, which could be linked to the distinct fine-tuning methodologies applied to these models.

5.2 Evaluation on Various Recommendation Tasks

Setups. To validate META ID’s adaptability, we extend our evaluation to include direct recommendation, rating prediction, explanation generation, and review tasks, akin to P5[13]. For direct recommendation, the model is asked to recommend item for users directly without providing user’s interaction history. For rating prediction, the model predicts a numerical rating between 1 and 5 based on user-item data. For explanation tasks, it generates textual justifications for a user’s preference towards an item, while in review tasks, it summarizes lengthy reviews into concise titles.

Baselines. We compare to three different ID construction strategies: RID, SID, and CID.

Evaluations. For direct recommendation, we apply the same metrics as in Section5.1. For rating prediction, we use MSE metric. For explanation and review tasks, we employ BLEU-1/4 metrics.

Implementation Details. For inferencing, we apply greedy decoding for rating, explanation, and review tasks, and beam search under the all-item setting for direct recommendation task.

Results.For direct recommendation (Table 2), META ID exceeds other methods across datasets in all-item setting. This suggests that META ID effectively model the direct relationship between users and items. The results for the other three tasks in Table 3, show that META ID significantly improves the BLEU scores Sports and Beauty compared to other methods. This result suggests that META ID can improve performance in text relevance tasks, including the interpretation of recommendations.

MethodsSportsBeautyToysH@5N@5H@10N@10H@5N@5H@10N@10H@5N@5H@10N@10RID[17]0.00300.00230.00420.00270.02030.01550.02760.01780.00460.00300.00630.0035SID[13]0.02110.01690.02670.01870.02960.02260.04050.02610.00250.00140.00410.0019CID[17]0.02500.01890.03420.02190.02160.01470.03400.01870.00760.00490.00140.0070META ID0.03570.02560.05200.03080.04800.03360.06890.04030.05640.03910.08030.0468

Task TypeMetricSportsBeautyRIDSIDCIDMETA IDRIDSIDCIDMETA IDRatingRMSE1.03821.04861.03831.03271.28291.30981.28191.2818ExplanationBLEU-116.256716.582516.612116.900518.229918.398119.349919.5106BLEU-42.17822.19442.23322.34812.90272.80713.06263.0592ReviewBLEU-17.61407.79487.65867.88196.22826.50556.58547.0500BLEU-42.32281.24062.41092.65461.98911.24061.97182.7485

5.3 Evaluation of ID Representation

Visualization. The amount of numeric tokens available in LLMs is relatively limited, which complicates the establishment of unique one-to-one ID relationships, and two unrelated items might share the tokens as ID. For an intuitive explanation, we visualized the cosine similarity matrix between items using heatmaps in Figure 1, where we random sample 50 items from the Toys dataset and take their adjusted cosine similarity from Equation4 as ground truth compared with RID, SID and META ID. RID and SID result in a large number of similar items due to semantic conflicts, while META ID shows distinguished similarity closer to ground truth. This suggests that using META ID allows LLMs to better capture relationships between users and items.

Quantitative Analysis. We quantitatively assess the quality of these ID representations by the proposed two metrics (Section 3.3): memorization score (MS) and diversity score (DS). Our results, shown in Figure LABEL:fig:score, indicate that constructing IDs of in-vocabulary tokens (RID and SID) perform poorly in the diversity dimension. For intuitive interpretation, we further employ t-SNE visualization to map ID representations and observe a tendency for these tokens to cluster narrowly in FigureLABEL:fig:score. META ID shows robust memorization and diversity across three datasets, reflecting its ability to

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (7)

capture correlations between users and items from historical data while ensuring items remain distinguishable.

Metrics Analysis. Furthermore, we conduct a correlation analysis to explore the relationship between MS/DS and sequential recommendation performance. We sum the MS and DS of different ID strategies and plot their performance on the sequential recommendation task. As shown in Figure 4, the sum of the MS and DS positively correlate with NDCG@10, suggesting that memorization and diversity of IDs are two essential properties in recommendation tasks.

5.4 Ablation Studies

We analyze the properties of META ID following the evaluation in Section 5.1, including the impact of

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (8)
Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (9)

grouping methods, the size of OOV tokens, and different indexing ranges on the performance of META ID.

MethodsSportsBeautyToysH@5N@5H@10N@10H@5N@5H@10N@10H@5N@5H@10N@10DBSCAN [9]0.00780.00430.01450.00650.02570.01680.04280.02230.01800.01090.03040.0149Spectral [20]0.01990.01240.03360.01670.03600.02360.05880.03100.02950.01840.05140.0254RQ-VAE [44]0.01220.00770.01710.00930.03680.02540.05360.03090.05110.03350.06670.0395K-Means [1]0.03220.02230.04870.02770.05100.03510.07530.04290.05030.03520.07420.0429

Token Grouping.We study the importance of different grouping methods for OOV token generation in our framework. In Table 4, we compare the performance of DBSCAN [9] and Spectral Clustering [20] against K-Means clustering [1]. Since related work [30] has not yet opened source code, here we implement it ourselves by generating OOV tokens with RQ-VAE [44] using meta-path-based embeddings. Our results show that simply applying K-Means outperforms other grouping methods in most cases.

OOV Token Size. Since varying cluster sizes G𝐺Gitalic_G result in different numbers of OOV tokens, we also investigate the impact of different cluster sizes for META ID in FigureLABEL:fig:token_a. We find that the granularity of token clusters plays a crucial role in recommendation performance. An excessive token scale can introduce noise, reducing the performance. Therefore, finding an optimal token size is vital to ensure that META ID effectively adapts to various datasets’ nuances.

User or Item Indexing.Previous ID strategies for LLMs only consider indexing for items, which come from the convention that users are typically represented by a sequence of interacted items in sequential recommendation[13, 17, 30]. While META ID models users and items, as shown in TableLABEL:fig:token_b, reveals that the combined user-item indexing (User&Item) outperforms either user-only or item-only indexing. This result shows the importance of incorporating user preferences and item attributes for LLMs to enhance the accuracy of the recommendations.

6 Conclusion

This study introduces META ID, a method enhancing Large Language Models (LLMs) for recommender systems using OOV tokens. Moving beyond constructing IDs with in-vocabulary tokens, META ID incorporates user-item interaction information to align LLMs more effectively with recommendation tasks. We learn representations from user-item interactions utilizing meta-paths sampling. By clustering these representations we generate OOV tokens to construct META ID. This approach guarantees tokens capturing correlations between users and items from historical data while ensuring distinctiveness among item. Our experiments across various real-world datasets demonstrate META ID’s robust performance in diverse recommendation tasks, including sequential and direct recommendation, as well as complex tasks requiring detailed textual responses.Essentially, META ID effectively combines the capabilities of LLMs with the nuanced requirements of recommendation scenarios, such as planning highly personalized content for users as virtual shopping assistants.

References

  • Arthur and Vassilvitskii [2007]David Arthur and Sergei Vassilvitskii.k-means++: the advantages of careful seeding.In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 1027–1035. SIAM, 2007.
  • Bao etal. [2023]Keqin Bao, ji*zhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He.Tallrec: An effective and efficient tuning framework to align large language model with recommendation.In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys 2023, Singapore, Singapore, September 18-22, 2023, pages 1007–1014. ACM, 2023.
  • Bi etal. [2022]Qiwei Bi, Jian Li, Lifeng Shang, Xin Jiang, Qun Liu, and Hanfang Yang.Mtrec: Multi-task learning over BERT for news recommendation.In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2663–2669. Association for Computational Linguistics, 2022.
  • Chen etal. [2019]Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou.Behavior sequence transformer for e-commerce recommendation in alibaba.In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, DLP-KDD ’19. Association for Computing Machinery, 2019.
  • Chen [2023]Zheng Chen.PALR: personalization aware llms for recommendation.CoRR, abs/2305.07622, 2023.
  • Cui etal. [2022]Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang.M6-rec: Generative pretrained language models are open-ended recommender systems.CoRR, abs/2205.08084, 2022.
  • Davidson etal. [2010]James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, TaylorVan Vleet, Ullas Gargi, Sujoy Gupta, YuHe, Mike Lambert, Blake Livingston, and Dasarathi Sampath.The youtube video recommendation system.In Proceedings of the 2010 ACM Conference on Recommender Systems, RecSys 2010, Barcelona, Spain, September 26-30, 2010, pages 293–296. ACM, 2010.
  • Dong etal. [2017]Yuxiao Dong, NiteshV. Chawla, and Ananthram Swami.metapath2vec: Scalable representation learning for heterogeneous networks.In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 135–144. ACM, 2017.
  • Ester etal. [1996]Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu.A density-based algorithm for discovering clusters in large spatial databases with noise.In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pages 226–231. AAAI Press, 1996.
  • Fan etal. [2019]Wenqi Fan, Yao Ma, Qing Li, Yuan He, YihongEric Zhao, Jiliang Tang, and Dawei Yin.Graph neural networks for social recommendation.In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pages 417–426. ACM, 2019.
  • Fan etal. [2023]Wenqi Fan, Zihuai Zhao, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Jiliang Tang, and Qing Li.Recommender systems in the era of large language models (llms).CoRR, abs/2307.02046, 2023.
  • Gao etal. [2023]Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang.Chat-rec: Towards interactive and explainable llms-augmented recommender system.CoRR, abs/2303.14524, 2023.
  • Geng etal. [2022]Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang.Recommendation as language processing (RLP): A unified pretrain, personalized prompt & predict paradigm (P5).In RecSys ’22: Sixteenth ACM Conference on Recommender Systems, Seattle, WA, USA, September 18 - 23, 2022, pages 299–315. ACM, 2022.
  • Hao etal. [2023]Yongjing Hao, Tingting Zhang, Pengpeng Zhao, Yanchi Liu, VictorS. Sheng, Jiajie Xu, Guanfeng Liu, and Xiaofang Zhou.Feature-level deeper self-attention network with contrastive learning for sequential recommendation.IEEE Trans. Knowl. Data Eng., 35(10):10112–10124, 2023.
  • Hidasi etal. [2016]Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.Session-based recommendations with recurrent neural networks.In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  • Hu etal. [2022]EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  • Hua etal. [2023]Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang.How to index item ids for recommendation foundation models.pages 195–204, 2023.
  • Ji etal. [2023]Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung.Survey of hallucination in natural language generation.ACM Comput. Surv., 55(12):248:1–248:38, 2023.
  • Kang and McAuley [2018]Wang-Cheng Kang and JulianJ. McAuley.Self-attentive sequential recommendation.In IEEE International Conference on Data Mining, ICDM 2018, Singapore, November 17-20, 2018, pages 197–206. IEEE Computer Society, 2018.
  • Kluger etal. [2003]Yuval Kluger, Ronen Basri, JosephT Chang, and Mark Gerstein.Spectral biclustering of microarray data: coclustering genes and conditions.Genome research, 13(4):703–716, 2003.
  • Koren etal. [2009]Yehuda Koren, RobertM. Bell, and Chris Volinsky.Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009.
  • Koren etal. [2022]Yehuda Koren, Steffen Rendle, and RobertM. Bell.Advances in collaborative filtering.In Recommender Systems Handbook, pages 91–142. Springer US, 2022.
  • Li etal. [2023]Jinming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni.Gpt4rec: A generative framework for personalized recommendation and user interests interpretation.CoRR, abs/2304.03879, 2023.
  • Li etal. [2020]Lei Li, Yongfeng Zhang, and LiChen.Generate neural template explanations for recommendation.In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 755–764. ACM, 2020.
  • Li etal. [2021]Lei Li, Yongfeng Zhang, and LiChen.Personalized transformer for explainable recommendation.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4947–4957. Association for Computational Linguistics, 2021.
  • Ma etal. [2019]Chen Ma, Peng Kang, and Xue Liu.Hierarchical gating networks for sequential recommendation.In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pages 825–833. ACM, 2019.
  • Ni etal. [2019]Jianmo Ni, Jiacheng Li, and Julian McAuley.Justifying recommendations using distantly-labeled reviews and fine-grained aspects.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, 2019.
  • Raffel etal. [2020a]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21:140:1–140:67, 2020a.
  • Raffel etal. [2020b]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21:140:1–140:67, 2020b.
  • Rajput etal. [2023]Shashank Rajput, Nikhil Mehta, Anima Singh, RaghunandanH. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, YiTay, VinhQ. Tran, Jonah Samost, Maciej Kula, EdH. Chi, and Maheswaran Sathiamoorthy.Recommender systems with generative retrieval.CoRR, abs/2305.05065, 2023.
  • Ren etal. [2020]Ruiyang Ren, Zhaoyang Liu, Yaliang Li, WayneXin Zhao, Hui Wang, Bolin Ding, and Ji-Rong Wen.Sequential recommendation with self-attentive multi-adversarial network.In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 89–98. ACM, 2020.
  • Rendle etal. [2009]Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.BPR: bayesian personalized ranking from implicit feedback.In UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009, pages 452–461. AUAI Press, 2009.
  • Sanh etal. [2022]Victor Sanh, Albert Webson, Colin Raffel, StephenH. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, MSaiful Bari, Canwen Xu, Urmish Thakker, ShanyaSharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, NihalV. Nayak, Debajyoti Datta, Jonathan Chang, MikeTian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, ZhengXin Yong, Harsh*t Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, JasonAlan Fries, Ryan Teehan, TevenLe Scao, Stella Biderman, Leo Gao, Thomas Wolf, and AlexanderM. Rush.Multitask prompted training enables zero-shot task generalization.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  • Sarwar etal. [2001]BadrulMunir Sarwar, George Karypis, JosephA. Konstan, and John Riedl.Item-based collaborative filtering recommendation algorithms.In Proceedings of the Tenth International World Wide Web Conference, WWW 10, Hong Kong, China, May 1-5, 2001, pages 285–295. ACM, 2001.
  • Sun etal. [2019]Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer.In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, pages 1441–1450. ACM, 2019.
  • Tang and Wang [2018]Jiaxi Tang and KeWang.Personalized top-n sequential recommendation via convolutional sequence embedding.In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 565–573. ACM, 2018.
  • Touvron etal. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov,and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models.CoRR, abs/2307.09288, 2023.
  • Wang etal. [2021]Jinpeng Wang, Jieming Zhu, and Xiuqiang He.Cross-batch negative sampling for training two-tower recommenders.In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 1632–1636. ACM, 2021.
  • Wang etal. [2023]Qi-Wei Wang, Hongyu Lu, YuChen, Da-Wei Zhou, De-Chuan Zhan, Ming Chen, and Han-Jia Ye.Streaming CTR prediction: Rethinking recommendation task for real-world streaming data.CoRR, abs/2307.07509, 2023.
  • Wang etal. [2022]Wei-Yao Wang, Wei-Wei Du, and Wen-Chih Peng.Recformer: personalized temporal-aware transformer for fair music recommendation.In Proceedings of the CIKM 2022 Workshops co-located with 31st ACM International Conference on Information and Knowledge Management (CIKM 2022), Atlanta, USA, October 17-21, 2022, volume 3318 of CEUR Workshop Proceedings. CEUR-WS.org, 2022.
  • Wei etal. [2022]Jason Wei, Maarten Bosma, VincentY. Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM. Dai, and QuocV. Le.Finetuned language models are zero-shot learners.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  • Wu etal. [2021]Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang.Empowering news recommendation with pre-trained language models.In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 1652–1656. ACM, 2021.
  • Xie etal. [2022]XuXie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin Ding, and Bin Cui.Contrastive learning for sequential recommendation.In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022, pages 1259–1273. IEEE, 2022.
  • Zeghidour etal. [2022]Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi.Soundstream: An end-to-end neural audio codec.IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022.
  • Zhang etal. [2023a]Junjie Zhang, Ruobing Xie, Yupeng Hou, WayneXin Zhao, Leyu Lin, and Ji-Rong Wen.Recommendation as instruction following: A large language model empowered recommendation approach.CoRR, abs/2305.07001, 2023a.
  • Zhang etal. [2023b]Yi-Kai Zhang, Ting-Ji Huang, Yao-Xiang Ding, De-Chuan Zhan, and Han-Jia Ye.Model spider: Learning to rank pre-trained models efficiently.In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b.
  • Zhang etal. [2021]Yuhui Zhang, HAO DING, Zeren Shui, Yifei Ma, James Zou, Anoop Deoras, and Hao Wang.Language models as recommender systems: Evaluations and limitations.In I (Still) Can’t Believe It’s Not Better! NeurIPS 2021 Workshop, 2021.
  • Zhao etal. [2023]WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.A survey of large language models.CoRR, abs/2303.18223, 2023.
  • Zhou etal. [2020]Kun Zhou, Hui Wang, WayneXin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen.S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization.In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 1893–1902. ACM, 2020.

Appendix

We provide details omitted in the main paper.

  • AppendixA: Workflow of META ID, encompassing the construction of OOV tokens.

  • AppendixB: Details of memorization score (MS) and diversity score (DS).

  • AppendixC: Experimental setups and implementation details of META ID.

  • AppendixD: Additional experimental result analysis.

  • AppendixE: Discussions and limitations of META ID.

Appendix A Details of META ID

In the Section4 of the main text, we elucidate the comprehensive workflow for generating META ID. This process encompasses three main steps, including (1) the extraction of meta-path-based embedding, (2) the generation of OOV tokens, and (3) the incorporation of META ID with LLMs, thereby handling with various downstream recommendation tasks.

A.1 How to extract the meta-path-based embedding

This section supplements the details of subsection4.1, i.e., the users / items representations extracted from a skip-gram model, including the sampling process of meta-paths as training data.

In META ID, we enable a skip-gram model to learn effective users / items representations from the sampled meta-paths, which is learning user representations 𝐖Usubscript𝐖𝑈\mathbf{W}_{U}bold_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT and item representations 𝐖Isubscript𝐖𝐼\mathbf{W}_{I}bold_W start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. The objective of the skip-gram model learning paradigm is to map the users and items in the meta-paths seqeuences into a lower-dimensional space as in[8]

Meta-paths sampling. Firstly, we constructs a node sequence based on random walks of meta-paths. A meta-path p=P1R1P2R2Rk1Pk𝑝subscript𝑃1subscript𝑅1subscript𝑃2subscript𝑅2subscript𝑅𝑘1subscript𝑃𝑘p=P_{1}\xrightarrow{R_{1}}P_{2}\xrightarrow{R_{2}}...\xrightarrow{R_{k-1}}P_{k}italic_p = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW … start_ARROW start_OVERACCENT italic_R start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a path that is defined on a graph, where Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT signifies a composite relation between different node P𝑃Pitalic_P. We define user-item-user (U-I-U) as our meta-path, where paths only exist if users has given the same ratings Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to one item. We sample 32 rounds starting from each user and item with the sampled length k=64𝑘64k=64italic_k = 64.

Skip-gram model training. In the second step, through sampling meta-paths based on random walks as training corpus, we train a skip-gram model thus learn the vector representations (𝐖U,𝐖I)subscript𝐖𝑈subscript𝐖𝐼(\mathbf{W}_{U},\mathbf{W}_{I})( bold_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) for all users and items. The objective of the skip-gram model is to maximize the conditional probability P(ni|v)𝑃conditionalsubscript𝑛𝑖𝑣P(n_{i}|v)italic_P ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_v ) for the node vV𝑣𝑉v\in Vitalic_v ∈ italic_V of its neighboring node niNvsubscript𝑛𝑖subscript𝑁𝑣n_{i}\in N_{v}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT:

argmaxθvVtU,IniNvlogP(ni|v)subscript𝜃subscript𝑣𝑉subscript𝑡𝑈𝐼subscriptsubscript𝑛𝑖subscript𝑁𝑣𝑃conditionalsubscript𝑛𝑖𝑣\small\arg\max_{\theta}\sum_{v\in V}\sum_{t\in{U,I}}\sum_{n_{i}\in N_{v}}\log P%(n_{i}|v)roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ italic_U , italic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_v )(9)

and the P(ni|v)𝑃conditionalsubscript𝑛𝑖𝑣P(n_{i}|v)italic_P ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_v ) is calculated as:

P(ni|v)=exp(wniTwv)jVexp(wjTwv)𝑃conditionalsubscript𝑛𝑖𝑣superscriptsubscript𝑤subscript𝑛𝑖𝑇subscript𝑤𝑣subscript𝑗𝑉superscriptsubscript𝑤𝑗𝑇subscript𝑤𝑣\small P(n_{i}|v)=\frac{\exp(w_{n_{i}}^{T}w_{v})}{\sum_{j\in V}\exp(w_{j}^{T}w%_{v})}italic_P ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_v ) = divide start_ARG roman_exp ( italic_w start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_V end_POSTSUBSCRIPT roman_exp ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) end_ARG(10)

where w𝑤witalic_w means the representation of one user or item. In experiments, we set number of negative sampling to 5, and train the skip-gram model for 10 epoch with learning rate of 0.001.

A.2 How to generate OOV tokens

This section complements subsection 4.2, where we generate OOV tokens from users and items representations for constructing META ID. Essentially, we need to build a hierarchical classification system for IDs in order to express a wider range of items and users with as few OOV tokens as possible, so that similar items and users are under the same hierarchical branch.

This hierarchical construction mechanism is very reminiscent of clustering methods, as we apply in META ID, in which we use out-of-class indexes and in-class indexes as two levels of IDs. Though more sophisticated clustering method s for multi-levels structure can be applied, in experiments, we use the simple K-Means clustering, which is more suitable for large-scale data volume due to its simple and easy to optimize nature. We also demonstrate the effectiveness of this approach in Table4.

In experiments, we cluster user and item representations together. We then create the between-cluster tokens CTidelimited-⟨⟩𝐶subscript𝑇𝑖\left\langle CT_{i}\right\rangle⟨ italic_C italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ for cluster i𝑖iitalic_i, and sort the in-cluster users / items based on the cosine distance to the cluster centroids to get in-cluster tokens yidelimited-⟨⟩subscript𝑦𝑖\left\langle y_{i}\right\rangle⟨ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩. Finally, we add a prefix token Itemdelimited-⟨⟩Item\left\langle\text{{Item}}\right\rangle⟨ Item ⟩ or Userdelimited-⟨⟩User\left\langle\text{{User}}\right\rangle⟨ User ⟩ to denote its type. For example, an item might be represented as "ItemCTijdelimited-⟨⟩Itemdelimited-⟨⟩𝐶subscript𝑇𝑖delimited-⟨⟩𝑗\left\langle\text{{Item}}\right\rangle\left\langle CT_{i}\right\rangle\left%\langle j\right\rangle⟨ Item ⟩ ⟨ italic_C italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ⟨ italic_j ⟩" labeled with three tokens ("Itemdelimited-⟨⟩Item\left\langle\text{{Item}}\right\rangle⟨ Item ⟩", "CTidelimited-⟨⟩𝐶subscript𝑇𝑖\left\langle CT_{i}\right\rangle⟨ italic_C italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩", "jdelimited-⟨⟩𝑗\left\langle j\right\rangle⟨ italic_j ⟩"), where "Itemdelimited-⟨⟩Item\left\langle\text{{Item}}\right\rangle⟨ Item ⟩" denotes it as an item, "CTidelimited-⟨⟩𝐶subscript𝑇𝑖\left\langle CT_{i}\right\rangle⟨ italic_C italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩" is its cluster token, and "yidelimited-⟨⟩subscript𝑦𝑖\left\langle y_{i}\right\rangle⟨ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩" is its in-cluster token.

It is worth noting that a naive approach to generating user and item IDs is to assign an independent OOV token as ID (IID) that needs to be learned for each item and user. However, this is not applicable to modern recommender systems with a large number of items and users, as the training time may be too long if a large number of new tokens need to be created. We also show it in Table7, where we use the meta-path-based embeddings with a linear projection layer for initialization, which shows that it is ineffective compared to META ID.

Sequential RecommendationDirect RecommendationRating Prediction
Task Input:Considering user_2024 has interacted with items item_1, item_2. What is the next recommendation for the user?What should we recommend for user_2024?Which star rating will user_2024 give to item item_2? (1 being the lowest and 5 being the highest).
Task Output:item_2024item_20245
ExplanationReview
Task Input:According to the feature word quality, generate a 5-star explanation for user_2 about item_2.Write a short sentence to summarize the following product review from user_2: Absolutely great product. I bought this for …
Task Output:Absolutely great product!Perfect!

MethodsSportsBeautyToysHR@5NDCG@5HR@10NDCG@10HR@5NDCG@5HR@10NDCG@10HR@5NDCG@5HR@10NDCG@10IID0.01140.00730.02080.01030.03020.01940.04940.02560.01460.00910.02170.0114META ID0.03570.02560.05200.03080.04800.03360.06890.04030.05640.03910.08030.0468

A.3 How to incorporate META ID with LLMs

As mentioned in subsection3.1, we convert every recommendation tasks into question&answering template, in which we describe the recommendation tasks in natural language form, and replace user and item IDs with different dataset like a cloze test. The full templates for every tasks in this format is from[13], where we give some examples in Table5 and Table6.

Take rating prediction task as example. We might ask the LLM, "Which star rating will user UserCT118delimited-⟨⟩Userdelimited-⟨⟩𝐶subscript𝑇1delimited-⟨⟩18\left\langle\text{User}\right\rangle\left\langle CT_{1}\right\rangle\left%\langle 18\right\rangle⟨ User ⟩ ⟨ italic_C italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ⟨ 18 ⟩ give item ItemCT824delimited-⟨⟩Itemdelimited-⟨⟩𝐶subscript𝑇8delimited-⟨⟩24\left\langle\text{Item}\right\rangle\left\langle CT_{8}\right\rangle\left%\langle 24\right\rangle⟨ Item ⟩ ⟨ italic_C italic_T start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT ⟩ ⟨ 24 ⟩ ?", and expect the LLM to answer "5".

We construct the fine-tuning and testing dataset for LLM in this unified way. Then LLM is able to acquire the generalized knowledge across different tasks, and even carve out user and item characteristics through those tokens constructing IDs to handle different recommendation tasks.

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (10)

Appendix B Details of memorization score (MS) and diversity score (DS)

This section complements subsection3.3. The token embedding layer in LLMs transforms each input token into token embedding vectors. And we use ID representation to indicate that these token embedding vectors corresponding to an ID, i.e., the representation of an item or user in LLMs.

The convergence of DS. DS is a metric designed to quantify the diversity of ID representations in LLMs. Given the high computational demand of calculating KL divergence for all embedding pairs in large datasets, DS employs a random sampling approach, thus we present a convergence analysis for DS in Figure 6. The stability of DS is evident across both datasets, demonstrating a trend towards convergence as the number of sampled pairs grows. This sampling strategy reducesthe computational complexity from O(|I|2D)𝑂superscript𝐼2𝐷O(|I|^{2}\cdot D)italic_O ( | italic_I | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_D ) to O(ND)𝑂𝑁𝐷O(N\cdot D)italic_O ( italic_N ⋅ italic_D ), where the |I|𝐼|I|| italic_I | means the size of items, D𝐷Ditalic_D is the dimension of ID representation, which is equal to the LLM’s token embedding dimension.

The approximate value of MS. The adjusted cosine similarity for items is given by:

sim(i,j)=uU(Ru,iR¯u)(Ru,jR¯u)uU(Ru,iR¯u)2uU(Ru,jR¯u)2sim𝑖𝑗subscript𝑢𝑈subscript𝑅𝑢𝑖subscript¯𝑅𝑢subscript𝑅𝑢𝑗subscript¯𝑅𝑢subscript𝑢𝑈superscriptsubscript𝑅𝑢𝑖subscript¯𝑅𝑢2subscript𝑢𝑈superscriptsubscript𝑅𝑢𝑗subscript¯𝑅𝑢2\text{sim}(i,j)=\frac{\sum_{u\in U}(R_{u,i}-\bar{R}_{u})\cdot(R_{u,j}-\bar{R}_%{u})}{{\sqrt{\sum_{u\in U}{(R_{u,i}-\bar{R}_{u})^{2}}\cdot{\sum_{u\in U}{(R_{u%,j}-\bar{R}_{u})^{2}}}}}}sim ( italic_i , italic_j ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ⋅ ( italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG(11)

where Ru,isubscript𝑅𝑢𝑖R_{u,i}italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT and Ru,jsubscript𝑅𝑢𝑗R_{u,j}italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT denote user u𝑢uitalic_u’s ratings for items i𝑖iitalic_i and j𝑗jitalic_j, respectively, while R¯usubscript¯𝑅𝑢\bar{R}_{u}over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is user u𝑢uitalic_u’s average rating. To enhance computational efficiency, especially for large-scale datasets, we precalculate the rating deviation sums and squared sums for each item and user:

sim(i,j)=Dev(i)Dev(j)DevS(i)DevS(j)superscriptsim𝑖𝑗Dev𝑖Dev𝑗DevS𝑖DevS𝑗\text{sim}^{\prime}(i,j)=\frac{\text{Dev}(i)\cdot\text{Dev}(j)}{\sqrt{\text{%DevS}(i)}\cdot\sqrt{\text{DevS}(j)}}sim start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_i , italic_j ) = divide start_ARG Dev ( italic_i ) ⋅ Dev ( italic_j ) end_ARG start_ARG square-root start_ARG DevS ( italic_i ) end_ARG ⋅ square-root start_ARG DevS ( italic_j ) end_ARG end_ARG(12)

where the rating deviation sums and squared sums for each item is:

Dev(i)=uUi(Ru,iR¯u),DevS(i)=uUi(Ru,iR¯u)2formulae-sequenceDev𝑖subscript𝑢subscript𝑈𝑖subscript𝑅𝑢𝑖subscript¯𝑅𝑢DevS𝑖subscript𝑢subscript𝑈𝑖superscriptsubscript𝑅𝑢𝑖subscript¯𝑅𝑢2\text{Dev}(i)=\sum_{u\in U_{i}}(R_{u,i}-\bar{R}_{u}),\ \text{DevS}(i)=\sum_{u%\in U_{i}}(R_{u,i}-\bar{R}_{u})^{2}Dev ( italic_i ) = ∑ start_POSTSUBSCRIPT italic_u ∈ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , DevS ( italic_i ) = ∑ start_POSTSUBSCRIPT italic_u ∈ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(13)

This approach reduces complexity from O(|U||I|2)𝑂𝑈superscript𝐼2O(|U|\cdot|I|^{2})italic_O ( | italic_U | ⋅ | italic_I | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to O(|I|2)𝑂superscript𝐼2O(|I|^{2})italic_O ( | italic_I | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), a significant improvement for large-scale datasets.

Appendix C Experimental Setups and Implementation Details

DatasetSportsBeautyToys
#Users35,59822,36319,412
#Items18,35712,10111,924
#Reviews296,337198,502167,597
#Sparsity (%)0.04530.07340.0724

C.1 Datasets Descriptions and Preprocessing

We conduct extensive experiments over three real-world datasets. The Amazon datasets are collected from Amazon platform222https://nijianmo.github.io/amazon with user ratings and reviews on 29 categories of products. In this paper, we adopt three of them to evaluate our method, Sports&Outdoors, Beauty, Toys&Games. We follow[13] and use transaction records between January 1, 2019 to December 31, 2019. Detailed dataset statistics are available in Table 8.

We divide tasks into ratings, explanations, and reviews, adhering to the data-splitting approaches of similar studies[13, 24, 25]. For both sequential and direct recommendation tasks, we adopt the methodology of [49, 13, 31], using the final item in a user’s interaction sequence for testing while carefully structuring the training data to avoid leakage. For rating, explanation, and review task families, we randomly split each dataset into training (80%), validation (10%) and testing (10%) sets, and ensure that there is at least one instance included in the training set for each user and item.

C.2 Baselines

Our approach is compared to a variety of established models, spanning from CNN-based to LLM-based frameworks. Caser [36] applies CNNs to capture high-order Markov Chains in sequential recommendation. HGN [26] utilizes hierarchical gating networks for modeling long and short-term user interests. GRU4Rec [15] employs GRUs for session-based recommendation, representing items with embedding vectors. BERT4Rec [35], S3-rec [49] and SASRec [19] employ self-attention mechanisms for sequential recommendation, focusing on bidirectional understanding and multi-head attention, respectively. FDSA [14] adopts feature-level self-attention for feature transitions. CL4SRec [43] introduces contrastive learning with data augmentation in sequential recommendation. P5 [13] learns different tasks with the same language modeling objective during pretraining, serving as the foundation model for various downstream recommendation tasks. P5 [13] is a recent method that uses a pretrained Large Language Model (LLM) to unify different recommendation tasks in a single model. Since there is no open source code for the recent work[30] yet, we implemented the key ID construction of it ourselves in subsection5.4.

In particular, we provide P5 with its variations[17], equipped with different ID constructs like Sequential IDs (SID), Random IDs (RID), and Collaborative IDs (CID) as a benchmark for exploring the impact of different ID strategies. RID Assigns each item with a random number as the item ID. The number is further tokenized into a sequence of sub-tokens, as did in P5. For example, an item is randomly assigned the number "2024", and "2024" is represented as a sequence of tokens "20""24". SID is a straightforward method to leverage collaborative information for item indexing, where items interacted consecutively by a user are assigned consecutive numerical indices, reflecting their co-occurrence. CID approach employs spectral clustering based on spectral matrix factorization to generate item IDs. This method is based on the premise that items with more frequent co-occurrence are more similar and should share more overlapping tokens in ID construction. The results for all baselines except P5 with its variations are reproduced through open source code[49].

C.3 Implementation Details

As mentioned in AppendixA, we generate the META ID for users and items for each dataset, generalized to all experiments below. For constructing META IDs, we sampling rating-based meta-paths in each dataset, where adjacent users assigned equal ratings to an item. We set the sampling path length to 64, and use a skip-gram model for training, with a window size of 5 and learning rate set at 1e31superscript𝑒31e^{-3}1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. The embedding clusters groups are limited to |G|=100𝐺100|G|=100| italic_G | = 100 (200 for Toys) to manage vocabulary size effectively. The OOV tokens size is shown in Table9.

OOV Tokens SizeSportsBeautyToys
RID000
SID000
CID448437487
IID183571210111924
META ID16001319727

C.3.1 Evaluation on Sequential and Direct Recommendation Tasks

We first evaluate META ID on sequential recommendation tasks and direct recommendation tasks following[17]. Our implementation first utilizes T5[28] as the backbone with parameters around 60.75 million. As mentioned in subsection4.3, we add a linear layer where the OOV tokens undergo an extra linear transformation before the token embedding layer for a better initialization with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1. We also consider decoder-only architecture LLaMA2-7b[37] with 7B parameters. For tokenization, we use the default SentencePiece tokenizer with extended OOV tokens for parsing sub-word units. We use the same sequential recommendation and direct recommendation prompts in P5[13] to convert sequential information into texts.

For LLM fine-tuning, we pre-train T5 for 10 epochs using AdamW optimizer on two NVIDIA RTX 3090 GPUs with a batch size of 64, a peak learning rate of 1e31superscript𝑒31e^{-3}1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. We apply warm-up for the first 5% of all training steps to adjust the learning rate, a maximum input token length of 1024. We use the lora[16] technique to fine-tune the token embedding layer and linear head layer of LLaMA2-7b for 1 epochs using AdamW optimizer on two NVIDIA RTX A6000 GPUs with a batch size of 28, a peak learning rate of 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, the lora attention dimension of 16 and the alpha parameter of 32.

For LLM inferencing, beam search is utilized to generate a list of potential next items, evaluated under the all-item setting. To prevent the generation of non-existent IDs, we apply a constrained decoding method[17], setting the generation probability of invalid IDs to zero.

Task TypeMetricSportsBeautyToysRIDSIDCIDMETA IDRIDSIDCIDMETA IDRIDSIDCIDMETA IDRatingRMSE1.03821.04861.03831.03271.28291.30981.28191.28181.07251.06931.07661.0770Explan.BLEU-116.256716.582516.612116.900518.229918.398119.349919.510619.985820.219820.457020.2270BLEU-42.17822.19442.23322.34812.90272.80713.06263.05924.34954.37014.58444.4945ReviewBLEU-17.61407.79487.65867.88196.22826.50556.58547.05008.53368.08627.38468.3080BLEU-42.32281.24062.41092.65461.98911.24061.97182.74851.13151.83661.21281.7061

C.3.2 Evaluation on Rating, Explanation and Review Tasks

To validate META ID’s adaptability, we extend our evaluation to rating prediction, explanation generation, and review tasks, akin to P5. We use the same prompts in P5[13] to convert all information into training texts.

For LLM fine-tuning, we pre-train T5 for 10 epochs using AdamW optimizer on two NVIDIA RTX 3090 GPUs with a batch size of 32, a peak learning rate of 1e31superscript𝑒31e^{-3}1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. We apply warm-up for the first 5% of all training steps to adjust the learning rate, a maximum input token length of 512 and maximum generation length of 64.

For LLM inferencing, greedy decoding is applied for rating prediction, explanation generation, and review tasks.

The full results are shown in Table10.

Appendix D Additional Experimental Results

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (11)

Initialization Approaches. We explore the impact of different token initialization methods on the performance of META ID. Recognizing that the LLM’s vocabulary includes numeric tokens for linguistic IDs, we first consider whether reinitializing these numeric tokens helps LLM for recommendation. As shown in Figure 7, our experiment, contrasting random initializing numeric tokens (Random Init.) with keeping T5’s original token embeddings (Embedding Init.), reveals that random initialization does not enhance performance, and be detrimental on Sports and Toys datasets. This suggests that the influence of pre-training on these tokens cannot be effectively negated through simple random initialization, thus not suitable for building IDs. This result emphasizes the importance of introducing extra tokens for IDs in LLM-based recommender systems. We also compare two initialization approaches for META ID: random initialization (Random Init.) and initializing OOV tokens using the augmentation (Embedding Init.) (See Section 4.2). Our findings show that the latter method substantially improves META ID’s performance, underlining the critical nature of the token initialization method in achieving better results.

Visualization of ID-related tokens. Directly applying in-vocabulary tokens to construct IDs (RID and SID) brings poor performance in Toys dataset. In Figure8a, We use t-SNE visualization to map ID token embeddings and observed that these tokens tend to be hom*ogeneous, whereas CID and META IDs that use OOV tokens to construct IDs have a wider distribution, reflecting difference and diversity between their representations.

To further illustrate the impact on representation, we visualized the attention mechanism in sequence recommendation generation in Figure8(b). This revealed that SID leads to uniform attention patterns, not distinguishing between different items and user IDs. In contrast, META ID demonstrates distinct attention patterns, successfully differentiating items and emphasizing user IDs, thereby allowing models to grasp more personalized and distinct information.

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (12)

DatasetsHR@5NCDG@5HR@10NCDG@10Beauty0.0510 ± 0.000380.0351 ± 0.000440.0752 ± 0.001310.0429 ± 0.00075Sports0.0322 ± 0.000610.0223 ± 0.000600.0487 ± 0.000730.0277 ± 0.00055Toys0.0503 ± 0.000910.0352 ± 0.000670.0742 ± 0.001380.0429 ± 0.00078

Modelsper userper user-item pairper review
SequentialDirectRatingExplanationSummarizationPreference
META ID (T)74.0568.605.2117.289.678.55

Statistics on training & Inference Time. We provide statistics on the training and inference time of P5 models, we collect the running time on the Toys dataset.As mentioned in subsectionC, we trained and test our models on two RTX 3090 GPUs. For training on sequential recommendation and direct recommendation tasks, the T5 model spent 3.5 hours to finish training. The average inference time of T5 model on dferent tasks are presented in Table12. Sequential and direct recommendation tasks require much longer inference time than other tasks due to the beam search step. Overall, the inference is very fast. It is also promising to further reduce the training and inference time with the help of effcient Transformer techniques.

Appendix E Discussions

There are two promising directions of META ID. First, META ID uses a fixed database of users and items, while newly appearing items and users do not have interaction history. This could be solved using methods related to the cold start issue. Second, META ID applies two-level tokens for constructing IDs, while a more complicated hierarchical structure could be considered. Then, META ID could be applied to modern recommender system containing trillions of users and items.

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens (2024)
Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5437

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.