Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (2024)

Chu-Cheng Lin , Xinyi Wang11footnotemark: 1 , Jonathan H. Clark, Han Lu, &Yun Zhu, Chenxi Whitehouse, Hongkun Yu

Google

equal contribution, {kitsing,xinyiwang}@google.com

Abstract

Adapting pretrained large language models(LLMs) to various downstream tasks in tens or hundreds of human languages is computationally expensive.Parameter-efficient fine-tuning(PEFT) significantly reduces the adaptation cost, by tuning only a small amount of parameters. However, common PEFT methods LoRA(Hu etal., 2022) suffer from suboptimal performance on diverse dataset mixtures, due to aggressive parameter tying and negative interference among different datasets.In this work, we propose Featurized Low-rank Mixtures(FLix), a novel PEFT method designed for effective multitask multilingual adaptation.FLix associates each unique dataset feature, such as the dataset’s language or task, with its own low-rank weight update parameters.By composing feature-specific parameters for each dataset, FLix can accommodate diverse dataset mixtures and generalize better to unseen datasets. Our experiments show that FLix leads to significant improvements over a variety of tasks for both supervised learning and zero-shot settings with gains of up to 14.214.214.214.2 in exact match points in zero-shot semantic parsing.

1 Introduction

Large language models(LLMs) have shown impressive performance on various real world applications in many different human languages (Brown etal., 2020; Soltan etal., 2022; Google etal., 2023).While there have been notable successes aligning an LLM to become a generalist which can follow human instructions to perform different tasks (Ouyang etal., 2022), there are also significant interests in adapting an LLM into specialists, each of which works on its specific task that is known a priori.

Intuitively, LLM adaptation can be done by continued training (or fine-tuning) on target languages and datasets. However, fine-tuning all model parameters on every dataset can be computationally and financially prohibitive.Parameter-efficient fine-tuning methods(PEFT) such as LoRA (Hu etal., 2022) and prompt tuning (Lester etal., 2021) reduce the computational costs of adapting LLMs to a downstream task. They parametrize LLM fine-tuning with a small set of trainable parameters, keeping the majority of LLM parameters frozen.

While PEFT has been widely used to adapt LLMs to a single dataset, very few prior works studied the best practices for adapting the model jointly on many different use cases.Vu etal. (2022) proposed adding a multilingual pretraining stage to prompt tuning using multilingual unlabeled data to improve zero-shot summarization. Wang etal. (2023b) proposed a multitask prompt tuning method that learns a single soft prompt which could be used to be adapt to other target tasks. However, these methods only considered either multilingual or multitask datasets, while we generally would like to adapt the model such that it generalizes along multiple axes(i.e. tasks and languages). Moreover, they require multiple tuning stages, which limits their applicability in practice.

In this paper, we propose Featurized Low-rank Mixtures (FLix), an extension of LoRA for modeling diverse dataset mixtures. Compared to LoRA which applies the same low-rank adaptation for all inputs, FLix parametrizes such updates to decompose linearly as a sum of feature-specific low-rank adaptations, each associated with an active dataset feature, such as language or task ID.Under FLix, different adaptations can be learned for different features. The compositional nature of FLix also provides an inductive bias for learning generalizable adaptations. Moreover,FLix is generally computationally efficient: it only needs to activate a tiny fraction of trainable parameters for each input, making both tuning and deployment efficient.FLix is related to prior works on sub-network composition(Lin etal., 2021; Ilharco etal., 2022), which show that models fine-tuned on different tasks could be composed together as a single model.

In this article, we contribute:

  • a modeling formulation that disentangles learning of tasks and languages in a way that improves generalization quality while remaining highly efficient;

  • experimental evidence that the model improves generalization quality across four very different tasks: named entity recognition, semantic parsing, in-language question answering, and cross-lingual question answering; and

  • evidence for the hypothesis that imbuing models with meta-data such as task and language improves quality, even when compared with powerful modern adaptation methods such as LoRA.

The rest of this paper is structured as follows: In §2, we first formulate the multitask multilingual learning setting, and discuss the challenges associated with the current PEFT methods.In §3, we describe the design of FLix, and how FLix adapts to zero-shot scenarios.We also propose to train FLix with feature dropout, which encourages positive transfer and generalization. We evaluate FLix on multitask or multilingual learning setting(§5.2), and later on joint multilingual multitask tuning and zero-shot generalization(§5.1). The experiment results and ablations show that FLix brings significant improvements over standard PEFT methods for all settings, and it is especially effective when used with very diverse training mixture and at zero-shot generalizations (§5, §6).

Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (1)
Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (2)

2 Multitask multilingual learning

We consider a multitask learning (Caruana, 1997) setting where datasets are indexed using task-language tuples.We assume that there are N𝑁Nitalic_N tasks {w1wN}subscript𝑤1subscript𝑤𝑁\{w_{1}\ldots w_{N}\}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and M𝑀Mitalic_M languages {1M}subscript1subscript𝑀\{\ell_{1}\ldots\ell_{M}\}{ roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … roman_ℓ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. And our goal is to serve fine-tuned LLMs for each of these NM𝑁𝑀N\cdot Mitalic_N ⋅ italic_M task-language combinations. It is worth noting that two special cases of this setup are multilingual modeling (N=1𝑁1N=1italic_N = 1) and multitask modeling (M=1𝑀1M=1italic_M = 1).

Parameter-efficient tuning(PEFT) is a popular method to adapt a pretrained LLM to downstream tasks without incurring large computational cost. In this work, we want to use PEFT to support O(MN)𝑂𝑀𝑁O(M\cdot N)italic_O ( italic_M ⋅ italic_N ) languages and tasks.

2.1 Low-rank Adaptation

We focus on Low-rank Adaptation(LoRA)(Hu etal., 2022), an effective PEFT method that incurs minimal additional inference cost.Let W0d×ksubscript𝑊0superscript𝑑𝑘W_{0}\in\mathbb{R}^{d\times k}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT be the pretrained weight matrix, LoRA keeps the much larger weight matrix W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT unchanged, and instead only optimizes low-rank factorized weight adaptation matrix ΔW=ϕAϕBΔ𝑊superscriptitalic-ϕ𝐴superscriptitalic-ϕ𝐵\Delta W={\phi}^{A}{\phi}^{B}roman_Δ italic_W = italic_ϕ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. The final weight matrix is

W=W0+ΔW=W0+ϕAϕB,𝑊subscript𝑊0Δ𝑊subscript𝑊0superscriptitalic-ϕ𝐴superscriptitalic-ϕ𝐵\displaystyle W=W_{0}+\Delta W=W_{0}+{\phi}^{A}{\phi}^{B},italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ϕ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ,(1)

where ϕAd×rsuperscriptitalic-ϕ𝐴superscript𝑑𝑟{\phi}^{A}\in\mathbb{R}^{d\times r}italic_ϕ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT, ϕBr×ksuperscriptitalic-ϕ𝐵superscript𝑟𝑘{\phi}^{B}\in\mathbb{R}^{r\times k}italic_ϕ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT. Empirically LoRA often compares competitively with full fine-tuning, even when rmin(d,k)much-less-than𝑟𝑑𝑘r\ll\min(d,k)italic_r ≪ roman_min ( italic_d , italic_k ). LoRA can thus significantly reduce the size of trainable parameters.

2.2 Challenges

PEFT methods such as LoRA have been shown to be very effective for tuning LLMs, achieving comparable results to full fine-tuning while incurring only a fraction of the computational cost.However, there are several challenges with the multitask multilingual learning problem that the current PEFT methods might not be able to address.

Interference among different datasets

Multitask multilingual training with PEFT requires one to fine-tune an LLM on a diverse data mixture, from up to NM𝑁𝑀N\cdot Mitalic_N ⋅ italic_M different datasets spanning N𝑁Nitalic_N tasks and M𝑀Mitalic_M languages. This approach has significantly lower overhead than modeling each dataset individually, and allows positive transfer. However, training a single set of PEFT parameters over all datasets could also lead to negative interference among tasks and languages that are dissimilar.

Generalizing to unseen task and language

Publicly available wide-coverage multitask and multilingual datasets are often incomplete. For example, many task-language combinations are missing in XTREME-UP. Moreover, some underrepresented languages may still be missing from such datasets.Standard PEFT methods could have difficulty generalizing to unseen task-language combinations and unseen languages: they simply optimize a single set of parameters on all tasks without explicit modeling assumptions that capture the relationships and similarities between different datasets. And it is not clear how to transform such PEFT parameters to a new task or language at inference time.

3 Featurized Low-Rank Mixtures

We propose Featurized Low-Rank Mixtures(FLix), an effective multitask PEFT method that supports diverse training data mixture, and excels at zero-shot generalization to new task-language combinations.Under FLix, NLP tasks and languages are featurized as discrete features. And each feature is associated with a low-rank weight update matrix. Figure1 shows the training and inference processes of FLix.

3.1 Model Architecture

Given a diverse data mixture from N𝑁Nitalic_N tasks where each tasks are in M𝑀Mitalic_M languages, first we define a set of D=N+M𝐷𝑁𝑀D=N+Mitalic_D = italic_N + italic_M features where each feature could represent either a task, a language, or any other data attribute. We assign a low-rank factorized weight matrix ϕiAϕiBsubscriptsuperscriptitalic-ϕ𝐴𝑖subscriptsuperscriptitalic-ϕ𝐵𝑖{\phi}^{A}_{i}{\phi}^{B}_{i}italic_ϕ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each feature i[1,D]𝑖1𝐷i\in[1,D]italic_i ∈ [ 1 , italic_D ].

Let 𝐱𝐱{\bm{\mathbf{x}}}bold_x be an input to the model, let f(𝐱)𝑓𝐱f({\bm{\mathbf{x}}})italic_f ( bold_x ) represent the features of 𝐱𝐱{\bm{\mathbf{x}}}bold_x, where fi(𝐱)=1subscript𝑓𝑖𝐱1f_{i}({\bm{\mathbf{x}}})=1italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) = 1 indicates that 𝐱𝐱{\bm{\mathbf{x}}}bold_x has feature i𝑖iitalic_i, and fi(𝐱)=0subscript𝑓𝑖𝐱0f_{i}({\bm{\mathbf{x}}})=0italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) = 0 otherwise.

W(𝐱)𝑊𝐱\displaystyle W({\bm{\mathbf{x}}})italic_W ( bold_x )=W0+i=1Dfi(𝐱)ϕiAϕiB,absentsubscript𝑊0superscriptsubscript𝑖1𝐷subscript𝑓𝑖𝐱subscriptsuperscriptitalic-ϕ𝐴𝑖subscriptsuperscriptitalic-ϕ𝐵𝑖\displaystyle=W_{0}+\sum_{i=1}^{D}{f_{i}({\bm{\mathbf{x}}})}{\phi}^{A}_{i}{%\phi}^{B}_{i},= italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) italic_ϕ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(2)

where ϕiAd×risubscriptsuperscriptitalic-ϕ𝐴𝑖superscript𝑑subscript𝑟𝑖{\phi}^{A}_{i}\in\mathbb{R}^{d\times r_{i}}italic_ϕ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, ϕiBri×ksubscriptsuperscriptitalic-ϕ𝐵𝑖superscriptsubscript𝑟𝑖𝑘{\phi}^{B}_{i}\in\mathbb{R}^{r_{i}\times k}italic_ϕ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_k end_POSTSUPERSCRIPT, and risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the maximum rank of the the i𝑖iitalic_i-th feature’s adaptation matrix.Note that compared to LoRA in equation1 that applies the same ΔWΔ𝑊\Delta Wroman_Δ italic_W (and therefore same W𝑊Witalic_W) for all inputs, FLix uses different adaptation matrices based on the features of the input data f(𝐱)𝑓𝐱f({\bm{\mathbf{x}}})italic_f ( bold_x ).

Feature dropout.

One potential problem of training FLix is that the model might become overly reliant on the feature annotation of the training dataset, limiting positive transfer and making the model brittle to inputs different from training distribution.We randomly turn off a subset of active features at training time with a predetermined feature dropout probability.Experiments in §6 show that feature dropout brings consistent gains to FLix.

Exploiting feature sparsity for low training and serving costs.

Note that equation2 implies that on input 𝐱𝐱{\bm{\mathbf{x}}}bold_x, W(𝐱)𝑊𝐱W({\bm{\mathbf{x}}})italic_W ( bold_x ) is not a function of {ϕiA,ϕiB}subscriptsuperscriptitalic-ϕ𝐴𝑖subscriptsuperscriptitalic-ϕ𝐵𝑖\{\phi^{A}_{i},\phi^{B}_{i}\}{ italic_ϕ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } if fi(𝐱)=0subscript𝑓𝑖𝐱0f_{i}({\bm{\mathbf{x}}})=0italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) = 0. In other words, weights of unused features of input 𝐱𝐱{\bm{\mathbf{x}}}bold_x are not needed in either training or serving time. While the number of trainable parameters under FLix grows linearly with the number of features D𝐷Ditalic_D, the feature count of each input 𝐱𝐱{\bm{\mathbf{x}}}bold_x remains a constant; and this is a relatively small value in our multitask multilingual learning settings. Therefore, the compute costs of FLix could still remain constant when scaling to increasingly more tasks and languages.

3.2 Zero-shot Composition

We find FLix to be particularly effective at zero-shot generalization, likely because of the explicit feature composition assumption. While previous work proposed using language-specific modules at pretraining time to enhance crosslingual transfer (Vu etal., 2022; Pfeiffer etal., 2023), our work shows that sparse modularized PEFT architecture is also effective for directly adapting dense pretrained models to unseen datasets.

In this paper, we consider how FLix adapts to unseen datasets in two different zero-shot scenarios:

Unseen combinations.

We want to do inference for a dataset that has a combination of active features that did not appear in the training mixture. For example, say the training data mixture contains QA data in French and semantic parsing data in Bengali; and we want to test the model on QA in Bengali. FLix naturally supports such features; and no change is required while applying equation2.

Unseen languages.

The test data could have a subset of features that are not covered by any of the dataset in the training mixture. Specifically, we focus on the setting where the test data is from a new language unseen during training. In this case, we only use the task feature of the data to calculate the model weights in equation2.

In §5, we show that FLix significantly outperforms the baselines for both types of zero-shot settings.

4 Experiments

For all experiments, we use instruction fine-tuned PaLM-2 small Google etal. (2023) as the base model. We evaluate our method on a variety of tasks, languages, and data mixtures to verify that it generalizes to different use cases.

Datasets.

We use the XTREME-UP dataset (Ruder etal., 2023) and format the input data with the language and task information(Wang etal., 2023a).We use different language and task subsets from XTREME-UP. Dataset details are described in §5.111A list of all experiments’ training and evaluation tasks can also be found in AppendixB.

Metrics.

We report F1 scores for both in- and cross-lingual QA tasks, and NER tasks. For semantic parsing, we report exact-match accuracy. All numbers are normalized between [0,100]0100[0,100][ 0 , 100 ]. We report average normalized metric scores for (monolingual) multitask experiments.

Hyperparameters.

We use a batch size of 512512512512 during training for all experiments. For FLix, we also set the feature dropout probability to be 0.70.70.70.7.

Ranks.

Compute-matched LoRA baselines have rank-6666 adaptation matrices222Strictly speaking their ranks 6absent6\leq 6≤ 6; we use this terminology loosely to reduce clutter. in all experiments. In this paper, every dataset has both task and language features; and we allocate feature-specific parameter counts such that the trainable parameters of active features under FLix always match compute-matched baselines. More specifically, we let task and language features have either rank-2222 or rank-4444 adaptation matrices. We allocate smaller matrices (rank-2222) for task features in multitask experiments (§5.2), and larger matrices in both multilingual (§5.2) and joint multitask-multilingual (§5.1) experiments. We adjust the ranks of language features’ adaptation matrices accordingly, to ensure that every dataset’s adaptation matrices have a total rank =6absent6=6= 6.

Model selection.

We evaluate on validation splits every 200 iterations, and train for a maximum of 2000 steps. For every multilingual and multitask task, we choose the checkpoint that has the best averaged metric numbers across languages or tasks, and subsequently evaluate on the test splits.

Baselines.

We compare to the vanilla LoRA method under several different settings to ensure the fairness of comparison:

  • Compute-matched sets the rank r𝑟ritalic_r of the vanilla LoRA model to be equivalent to the maximum sum of the rank of feature-specific adaptations under FLix. This ensures LoRA and FLix uses comparable computation during training and inference.

  • Param-matched sets the rank r𝑟ritalic_r of the vanilla LoRA model such that the total number of trainable parameters is the same as its FLix counterpart.333The number of trainable parameters in FLix grows linearly with the number of features in a dataset. Therefore the param-matched counterpart’s rank can be a relatively large number from around 20 to 90, depending on the task.

5 Results

5.1 Study A: Joint Multitask Multilingual Learning

MethodNERSemantic ParsingQA InLangQA CrossLangZero-shot Unseen Comb: QA CrossLangZero-shot Unseen Comb: Semantic parsingZero-shot Unseen Lang: QA CrossLangAvg.
Compute-matched76.335.486.683.182.114 952 0982.1149520982.114\,952\,0982.114 952 0917.707 970 317.707970317.707\,970\,317.707 970 374.568 217 3874.5682173874.568\,217\,3874.568 217 3865.1
Param-matched81.645 797 7381.6457977381.645\,797\,7381.645 797 7347.187.383.189 236 9683.1892369683.189\,236\,9683.189 236 9680.025 490 5780.0254905780.025\,490\,5780.025 490 5728.435 936 1928.4359361928.435\,936\,1928.435 936 1974.716 568 4274.7165684274.716\,568\,4274.716 568 4268.9
FLix84.045.590 648 6545.5906486545.590\,648\,6545.590 648 6589.485.282.842.677.272.4

In this section, we evaluate the performance of FLix and baselines on a diverse data mixture, where the tuning data contain different languages and tasks.

5.1.1 Multitask multilingual tuning

We conduct multitask multilingual tuning on both vanilla LoRA and our proposed FLix model. This setting shows one of the primary strengths of FLix: The ability to simultaneously generalize along multiple axes (tasks and languages) without suffering from negative task interference. We evaluate on four tasks from XTREME-UP covering a wide variety of use cases and languages: crosslingual QA, in-language QA, semantic parsing, NER. In addition, we also add training data from machine translation to the data mixture since it has the best language coverage which allows cross-lingual transfer.We use all languages in in-language QA, semantic parsing, NER, and a subset of languages in cross-lingual QA as training data.444We only use the languages included in the original XOR-TyDi QA(Asai etal., 2021): Arabic, Bengali, Japanese, Finish, Korean, Russian, Telugu. The rest of the cross-lingual QA languages  — all Indic languages  — are evaluated for zero-shot generalization in §5.1.2. In addition, we also include the corresponding machine translation datasets of languages from these 4 datasets.

FLix significantly outperforms baselines under multitask multilingual tuning.

The overall results are listed in Table1. We can see that FLix out-performs the the best LoRA baseline for all tasks other than semantic parsing. While it loses to param-matched LoRA on semantic parsing, FLixhas significantly less computational cost in comparison and it significantly out-performs the vanilla LoRA with the same computational cost.

5.1.2 Zero-shot Generalization

To evaluate both zero-shot generalizations (§3.2) we prepare two different training datasets:

  • Holding out languages in cross-lingual QA. We reuse the training dataset from the joint multitask multilingual learning setup (§5.1) to evaluate both unseen feature combinations and unseen languages in the cross-lingual QA task. For unseen combinations, we evaluate the performance of the set of languages where the languages were present in other multilingual tasks in the training dataset; and for unseen languages, we evaluate on the set of languages that do not appear in other multilingual tasks in the training dataset.

  • Holding out languages in semantic parsing. In this scenario, we use portions of crosslingual QA, in-language QA, semantic parsing, NER, and machine translation datasets from XTREME-UP as our training dataset. We include full crosslingual QA, in-language QA, NER datasets; but we do not include underrepresented languages in the semantic parsing portion of the training data. We also only include machine translation portions of languages that are already available in the other 4444 subsets, as in §5.1. And we evaluate the unseen combination performance of all held-out languages under semantic parsing.555These held-out languages are unseen combinations, rather than unseen languages, since they are available in the machine translation datasets of XTREME-UP.

Our method is very effective at both types of zero-shot generalization.

The comparison of FLix and baseline methods can be found in the rightmost 3 columns in Table1. We can see that FLix brings significant improvements to both unseen combinations and unseen languages on both cross-lingual QA and semantic parsing. This is likely because FLix allows one to select the subset of parameters most relevant to the features of the new test data, allowing more effective zero-shot generalization.

5.2 Study B: Multitask or multilingual learning

In this section, we examine the performance of our method and the baselines on data mixtures with a single type of feature. That is, we train and evaluate on datasets of a single task from several different languages(multilingual learning), or datasets of a single language with different tasks(multitask learning). We also add additional baselines for this setting where we train a separate LoRA model for each individual dataset (denoted as Single-Lang and Single-Task in Table2). This approach alleviates the capacity constraint of vanilla LoRA on diverse datasets, but adds more engineering overhead (as we briefly noted in §1) and cannot leverage positive transfer between datasets.

Specifically, we use subsets of the XTREME-UP dataset to construct these datasets:

  • Multilingual learning. We experiment on four tasks: NER, semantic parsing, in-language QA, and cross-lingual QA. For each task, we train and evaluate on all languages available in the XTREME-UP dataset. Unlike §5.1, we train a separate multilingual model for each task in this setting.

  • Multitask learning. We evaluate on two under-represented languages: Swahili and Bengali. We use the subset of tasks available for each language in XTREME-UP. For Swahili, we use semantic parsing, QA (in-language), and NER. For Bengali we use semantic parsing and QA (both in-language and cross-lingual).666The mismatch between task choices between these two languages is due to the sparsity of available datasets in XTREME-UP. Unlike §5.1, we train a separate multitask model for each language in this setting.

MethodMultilingual LearningMultitask LearningAvg.
NERSemantic ParsingQA InLangQA CrossLangSwahiliBengali
Compute-matched81.297 913 7981.2979137981.297\,913\,7981.297 913 7940.491 851 8140.4918518140.491\,851\,8140.491 851 8188.469 852 8788.4698528788.469\,852\,8788.469 852 8777.274 442 8277.2744428277.274\,442\,8277.274 442 8266.044 353 5666.0443535666.044\,353\,5666.044 353 5663.836 740 4963.8367404963.836\,740\,4963.836 740 4969.6
Param-matched83.228 062 6983.2280626983.228\,062\,6983.228 062 6947.289.040 080 689.040080689.040\,080\,689.040 080 676.191 510 7176.1915107176.191\,510\,7176.191 510 7166.778 000 9166.7780009166.778\,000\,9166.778 000 9165.786 265 6965.7862656965.786\,265\,6965.786 265 6971.3
FLix84.345.030 633 4745.0306334745.030\,633\,4745.030 633 4789.477.670.269.072.6

No baseline method consistently out-performs other baselines.

We report the performance of our method and the baselines in Table2. For each experiment, we list the average result over all languages or tasks in the datasets. First, we find that among the different baseline methods, there is no method that consistently out-performs others. Specifically, param-matched LoRA tends to have advantage for multitask learning and tasks that are very different from pretraining(semantic parsing and NER), while compute-matched LoRA appears to be superior on multilingual QA tasks. We suspect that this is because param-matched LoRA benefits from its higher capacity for the model to learn to generate into targets that are very different from the pretraining data, and it is also helpful for supporting generation into diverse target tasks. However, param-matched LoRA has significantly higher numbers of trainable parameters. This could lead to much higher computational overhead compared to compute-matched LoRA. Moreover, we observe that compute-matched LoRA is actually more competitive on tasks such as crosslingual QA, likely because it reduces over-fitting by tuning a much smaller number of parameters.

FLix achieves much better performance than vanilla LoRA with the same computation budgets.

FLix consistently outperforms compute-matched LoRA baseline on all settings, achieving significant gains without adding additional serving cost. Furthermore, our proposed method also outperforms param-matched LoRA on five out of the six data mixtures we evaluated.

We hypothesize that param-matched LoRA is better than FLix at semantic parsing but worse at other tasks because semantic parsing requires the LLM to generate into structured data that are very different from the pretraining data distribution, which might require particularly large model capacity to learn. In fact, Table2 shows that param-matched LoRA with a higher rank is much better than compute-matched LoRA for semantic parsing, while being worse or comparable on question answering tasks. While param-matched LoRA could be particularly helpful for the semantic parsing task, it requires more computational resources to scale to large number of datasets. These results indicate that our method is an effective and computationally-efficient strategy for tuning LLMs on diverse data mixtures.

6 Analysis and Ablations

6.1 Effect of feature dropout

Feature dropout strengthValidationTest
p=0.7𝑝0.7p=0.7italic_p = 0.787.572 028 2787.5720282787.572\,028\,2787.572 028 2789.293 668 9689.2936689689.293\,668\,9689.293 668 96
p=0.5𝑝0.5p=0.5italic_p = 0.587.310 785 9387.3107859387.310\,785\,9387.310 785 93 (0.30.3-0.3- 0.3)88.794 375 9588.7943759588.794\,375\,9588.794 375 95 (0.50.5-0.5- 0.5)
p=0.3𝑝0.3p=0.3italic_p = 0.387.106 421 5887.1064215887.106\,421\,5887.106 421 58 (0.50.5-0.5- 0.5)88.668 604 5388.6686045388.668\,604\,5388.668 604 53 (0.60.6-0.6- 0.6)
p=0.0𝑝0.0p=0.0italic_p = 0.086.893 094 3886.8930943886.893\,094\,3886.893 094 38 (0.70.7-0.7- 0.7)88.923 055 8688.9230558688.923\,055\,8688.923 055 86 (0.40.4-0.4- 0.4)

In this section, we evaluate the effectiveness of feature dropout for FLix. We compare the performance of FLix with and without feature dropout for in-language QA task using multilingual training, and the results are in Table3. We can see that removing feature dropout leads to significant performance drop for both validation and test set. We hypothesize that this is because feature dropout is an effective regularization that encourages FLix to utilize the shared parameters for positive transfer between tasks and languages.

RankQA CrossLangQA InLangSemantic ParsingNER
222277.689.445.084.3
111176.5(1.11.1-1.1- 1.1)89.3(0.10.1-0.1- 0.1)45.2(+0.20.2+0.2+ 0.2)83.1(1.21.2-1.2- 1.2)

6.2 Effect of rank for FLix

In multilingual experiments (Table2) the language features only have rank-2222 feature-specific weight update matrices.Such low rank configurations help reduce the overall parameter count.In this section, we examine how well FLix performs under an even smaller budget. Table4 shows the results of FLix on multitask multilingual tuning with rank set to 4 and 2. We can see that reducing the capacity of the feature-specific weight update matrices actually only lead to a small drop in performance on most tasks, indicating that FLix could be effective under even restrictive computational requirements.

6.3 FLix performs increasingly better on a diverse training dataset

Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (3)

In §5 we examined the performance of FLix and baselines under each data mixtures. Here we want to compare how different methods perform when using increasingly more diverse data mixtures. Specifically, we compare the change in task performance when using the multitask multilingual mixture as opposed to using only the multilingual data for each task. Figure2 shows the results of FLix and the baselines. We can see that vanilla LoRA suffers from negative transfer Wang etal. (2018). In particular, the compute-matched version with a smaller rank has significant decrease in performance on a diverse multitask multilingual training dataset. On the other hand, FLix is able to generally maintain similar or slightly higher performance when using the more diverse data mixture. This result indicates that FLix is a superior PEFT method to scale to large number of tasks and languages.

7 Related Work

Multilingual/multitask PEFT

While most prior works on parameter-efficient tuning focus on tuning the model for a single task, some recent works propose post-hoc composition pretrained PEFT modules to generalize to new task or language(Pfeiffer etal., 2020; Huang etal., 2023; Wang etal., 2021; Chronopoulou etal., 2023). These methods generally require separate training runs for each dataset, while our work propose a parameter-efficient tuning method that adapts the LLM using diverse data mixture in a single training run. Vu etal. (2022) proposed to add a multilingual pretraining stage to prompt tuning, which shows some improvements for zero-shot generalization when adapting to cross-lingual summarization task. In comparison, our proposed FLix focuses on tuning using multilingual data in many downstream tasks without additional training on unlabeled pretrainining data. Wang etal. (2023b) proposed a multitask prompt tuning method to learn a single prompt which could be used to adapt to many other target tasks. Similarly, this method requires multiple training stage while our method allows training end-to-end. Both of these methods are built upon prompt tuning using either multilingual training in a single task or multitask tuning in English, while our method supports diverse datasets in different tasks languages or any other arbitrary features.

Mixture-of-experts Models

Mixture-of-experts models (MoEs)(Shazeer etal., 2017; Lepikhin etal., 2020; Jawahar etal., 2023) are effective at increasing the model capacity by adding multiple expert parameters which could be activated differently to support different inputs. This architecture has been used to improve both pretrained models(Lepikhin etal., 2020; Jawahar etal., 2023) and parameter-efficient tuning methods(Zadouri etal., 2023; Zhu etal., 2023). Since MoEs often adds more computational cost, many works try to reduce the cost and improve the effectiveness of the model by either task-level routing (e.g., Task MoEs proposed by Kudugunta etal. (2021)) or encouraging sparsity of the experts(Shazeer etal., 2017; Lepikhin etal., 2020). FLix resembles Task MoEs in that FLix leverages task information as well. But FLix has additional composition capabilities thanks to its featurization.

Modularized Pretrained Models

Previous work proposed to add language-specific parameters to multilingual pretrained models(Pfeiffer etal., 2022; 2023) or machine translation models(Zhang etal., 2020). These works showed that language-specific modules often bring significant improvements to multilingual tasks. And they are especially helpful for zero-shot cross-lingual transfer. While prior works added language-specific modules during multilingual pretraining,in this paper we focus on the problem of adapting a pretrained model to a diverse mixture with many tasks and languages.

8 Future Work

There are several promising future directions for our work. While FLix achieves good zero-shot performance, it could potentially benefit from methods that automatically learn to select and compose parameters pretrained on different features for unseen data. It is also interesting to examine other applications of FLix: we encoded task and language informations as features in this work. But potentially other properties, such as modality, could also be featurized under FLix.

9 Conclusion

In this paper, we propose Featurized Low-Rank Mixtures(FLix), an effective parameter-efficient tuning method to fine-tune pretrained LLMs to data mixtures containing datasets in diverse tasks and languages. Our experiments show that FLix leads to significantly better performance on multitask multilingual fine-tuning compared to standard LoRA with little computational overhead. We also find that FLix achieves much better zero-shot generalization to new languages and task language combination unseen at training time.

References

  • Asai etal. (2021)Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and HannanehHajishirzi.XOR QA: Cross-lingual open-retrieval question answering.In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, DilekHakkani-Tur, IzBeltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty,and Yichao Zhou (eds.), NAACL, Online, June 2021. Association forComputational Linguistics.
  • Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, RewonChild, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, ChrisHesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,and Dario Amodei.Language models are few-shot learners.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin(eds.), Advances in Neural Information Processing Systems, volume33,pp. 1877–1901. Curran Associates, Inc., 2020.URLhttps://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  • Caruana (1997)Rich Caruana.Multitask learning.Machine Learning, 28:41–75, 1997.URL https://api.semanticscholar.org/CorpusID:45998148.
  • Chronopoulou etal. (2023)Alexandra Chronopoulou, Jonas Pfeiffer, Joshua Maynez, Xinyi Wang, SebastianRuder, and Priyanka Agrawal.Language and task arithmetic with parameter-efficient layers forzero-shot summarization, 2023.
  • Google etal. (2023)Google, Rohan Anil, AndrewM. Dai, Orhan Firat, Melvin Johnson, DmitryLepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey,Zhifeng Chen, Eric Chu, JonathanH. Clark, LaurentEl Shafey, Yanping Huang,Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, KevinRobinson, Sebastian Ruder, YiTay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang,GustavoHernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha,James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng,Colin Cherry, ChristopherA. Choquette-Choo, Aakanksha Chowdhery, ClémentCrepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz,Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, MarkusFreitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari,Steven Hand, Hadi Hashemi, LeHou, Joshua Howland, Andrea Hu, Jeffrey Hui,Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, WenhaoJia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, KatherineLee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, HyeontaekLim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, AromaMahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, JohnNham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek,Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, ParkerRiley, AlexCastro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, ReneeShelby, Ambrose Slone, Daniel Smilkov, DavidR. So, Daniel Sohn, SimonTokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang,Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, YunhanXu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng,CeZheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu.Palm 2 technical report, 2023.
  • Hu etal. (2022)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, SheanWang, LuWang, and Weizhu Chen.LoRA: Low-rank adaptation of large language models.In International Conference on Learning Representations, 2022.URL https://openreview.net/forum?id=nZeVKeeFYf9.
  • Huang etal. (2023)Chengsong Huang, Qian Liu, BillYuchen Lin, Tianyu Pang, Chao Du, and Min Lin.Lorahub: Efficient cross-task generalization via dynamic loracomposition.arXiv preprint arXiv:2307.13269, 2023.
  • Ilharco etal. (2022)Gabriel Ilharco, MarcoTulio Ribeiro, Mitchell Wortsman, Suchin Gururangan,Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi.Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022.
  • Jawahar etal. (2023)Ganesh Jawahar, Subhabrata Mukherjee, Xiaodong Liu, YoungJin Kim, MuhammadAbdul-Mageed, Laks Lakshmanan, V.S., AhmedHassan Awadallah, SebastienBubeck, and Jianfeng Gao.AutoMoE: Heterogeneous mixture-of-experts with adaptivecomputation for efficient neural machine translation.In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Findings of the Association for Computational Linguistics: ACL 2023,pp. 9116–9132, Toronto, Canada, July 2023. Association for ComputationalLinguistics.doi: 10.18653/v1/2023.findings-acl.580.URL https://aclanthology.org/2023.findings-acl.580.
  • Kudugunta etal. (2021)Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin,Minh-Thang Luong, and Orhan Firat.Beyond distillation: Task-level mixture-of-experts for efficientinference.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and ScottWen-tau Yih (eds.), Findings of the Association for ComputationalLinguistics: EMNLP 2021, pp. 3577–3599, Punta Cana, Dominican Republic,November 2021. Association for Computational Linguistics.doi: 10.18653/v1/2021.findings-emnlp.304.URL https://aclanthology.org/2021.findings-emnlp.304.
  • Lepikhin etal. (2020)Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, YanpingHuang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen.Gshard: Scaling giant models with conditional computation andautomatic sharding, 2020.
  • Lester etal. (2021)Brian Lester, Rami Al-Rfou, and Noah Constant.The power of scale for parameter-efficient prompt tuning.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and ScottWen-tau Yih (eds.), Proceedings of the 2021 Conference on EmpiricalMethods in Natural Language Processing, Online and Punta Cana, DominicanRepublic, November 2021. Association for Computational Linguistics.URL https://aclanthology.org/2021.emnlp-main.243.
  • Lin etal. (2021)Zehui Lin, Liwei Wu, Mingxuan Wang, and Lei Li.Learning language specific sub-network for multilingual machinetranslation.In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.),ACL, Online, August 2021. Association for Computational Linguistics.
  • Ouyang etal. (2022)Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, PamelaMishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, JohnSchulman, Jacob Hilton, Fraser Kelton, LukeE. Miller, Maddie Simens, AmandaAskell, Peter Welinder, PaulFrancis Christiano, Jan Leike, and RyanJ. Lowe.Training language models to follow instructions with human feedback.ArXiv, abs/2203.02155, 2022.URL https://api.semanticscholar.org/CorpusID:246426909.
  • Pfeiffer etal. (2020)Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder.MAD-X: An Adapter-Based Framework for Multi-TaskCross-Lingual Transfer.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings of the 2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pp. 7654–7673, Online, November 2020.Association for Computational Linguistics.doi: 10.18653/v1/2020.emnlp-main.617.URL https://aclanthology.org/2020.emnlp-main.617.
  • Pfeiffer etal. (2022)Jonas Pfeiffer, Naman Goyal, XiLin, Xian Li, James Cross, Sebastian Riedel,and Mikel Artetxe.Lifting the curse of multilinguality by pre-training modulartransformers.In Marine Carpuat, Marie-Catherine deMarneffe, and IvanVladimirMezaRuiz (eds.), Proceedings of the 2022 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, pp. 3479–3495, Seattle, United States, July 2022.Association for Computational Linguistics.doi: 10.18653/v1/2022.naacl-main.255.URL https://aclanthology.org/2022.naacl-main.255.
  • Pfeiffer etal. (2023)Jonas Pfeiffer, Francesco Piccinno, Massimo Nicosia, Xinyi Wang, Machel Reid,and Sebastian Ruder.mmt5: Modular multilingual pre-training solves source languagehallucinations, 2023.
  • Ruder etal. (2023)Sebastian Ruder, JonathanH. Clark, Alexander Gutkin, Mihir Kale, Min Ma,Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-MichelA. Sarr, XinyiWang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, DanaL.Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, DavidI. Adelani, VeraAxelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, MelvinJohnson, Dmitry Panteleev, and Partha Talukdar.Xtreme-up: A user-centric scarce-data benchmark for under-representedlanguages, 2023.
  • Shazeer etal. (2017)Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le,Geoffrey Hinton, and Jeff Dean.Outrageously large neural networks: The sparsely-gatedmixture-of-experts layer, 2017.
  • Soltan etal. (2022)Saleh Soltan, Shankar Ananthakrishnan, Jack G.M. FitzGerald, Rahul Gupta, WaelHamza, Haidar Khan, CharithS. Peris, Stephen Rawls, Andrew Rosenbaum, AnnaRumshisky, Chandan Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv Verma,Gokhan Tur, and Premkumar Natarajan.Alexatm 20b: Few-shot learning using a large-scale multilingualseq2seq model.ArXiv, abs/2208.01448, 2022.URL https://api.semanticscholar.org/CorpusID:251253416.
  • Vu etal. (2022)TuVu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, and Noah Constant.Overcoming catastrophic forgetting in zero-shot cross-lingualgeneration.In EMNLP, 2022.
  • Wang etal. (2021)Xinyi Wang, Yulia Tsvetkov, Sebastian Ruder, and Graham Neubig.Efficient test time adapter ensembling for low-resource languagevarieties.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and ScottWen-tau Yih (eds.), EMNLP, Punta Cana, Dominican Republic, November2021.
  • Wang etal. (2023a)Xinyi Wang, John Wieting, and JonathanH. Clark.Fiat: Fusing learning paradigms with instruction-accelerated tuning,2023a.
  • Wang etal. (2023b)Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Huan Sun, and YoonKim.Multitask prompt tuning enables parameter-efficient transferlearning.In ICLR, 2023b.
  • Wang etal. (2018)Zirui Wang, Zihang Dai, Barnabás Póczos, and JaimeG. Carbonell.Characterizing and avoiding negative transfer.2019 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pp. 11285–11294, 2018.URL https://api.semanticscholar.org/CorpusID:53748459.
  • Zadouri etal. (2023)Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, andSara Hooker.Pushing mixture of experts to the limit: Extremely parameterefficient moe for instruction tuning, 2023.
  • Zhang etal. (2020)Biao Zhang, Ankur Bapna, Rico Sennrich, and Orhan Firat.Share or not? learning to schedule language-specific capacity formultilingual translation.In ICLR, 2020.
  • Zhu etal. (2023)Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, HanLu, Canoee Liu, Liangchen Luo, Jindong Chen, etal.Sira: Sparse mixture of low rank adaptation.arXiv preprint arXiv:2311.09179, 2023.

Appendix A Constant parameter sharing hurts FLix

ConfigurationQA InLangQA CrossLangSemantic parsingNER
FLix89.443 008 4289.4430084289.443\,008\,4289.443 008 4285.285.285.285.245.590 648 6545.5906486545.590\,648\,6545.590 648 6584.084.084.084.0
+ shared parameters88.238 650 8588.2386508588.238\,650\,8588.238 650 8583.972 558 3483.9725583483.972\,558\,3483.972 558 3441.894 719 4441.8947194441.894\,719\,4441.894 719 4480.676 577 9980.6765779980.676\,577\,9980.676 577 99
Compute-matched LoRA86.686.686.686.683.183.183.183.135.435.435.435.476.376.376.376.3

Our proposed FLix routes each input to their feature-specific adaptations, based on the dataset features.While constant parameter sharing happens when some features are always active for all datasets in a mixture (e.g., in experiments in §5.2), this is not generally true.In this section, we look into the effects of enforcing constant parameter-sharing in FLix. Specifically, we design a dummy feature that is always active to all datasets in the joint multitask multilingual training mixture used in §5.1.1, which implies constant parameter sharing.We then compare FLix models both without and with this dummy feature, and also with a vanilla LoRA baseline, all under comparable compute budgets.777Specifically, we rearrange the rank sizes of all features such that the sum is 6666. The dummy feature has rank=4absent4=4= 4, and the language and task features have rank=1absent1=1= 1. The compute-matched LoRA baseline does not make use of dataset features, and has rank=6absent6=6= 6.

In Table5, we can see that adding a shared parameter actually leads to worse performance for FLix with similar computational cost. However, FLix with constantly shared features still outperforms computed-matched LoRA, which does not leverage task or language features at all.

Appendix B Training and evaluation datasets

B.1 Multilingual learning

The (mono-task) multilingual learning experiments described in §5.2 train and evaluate on the same languages. Their language and locale codes are:

Semantic parsing

am, be, bn, fi, ha, hu, ja, pt_br, ru, sw, ta, tr, yo, zu, de_localized, en, de, es, fr, hi, th

In-language QA

ar, bn, fi, id, ko, ru, sw, te, en

Cross-lingual QA

ar, bn, fi, ko, ru, te, as, bho, brx, gbm, gom, gu, hi, hne, kn, mai, ml, mni, mr, mwr, or, pa, ps, sa, ta, ur

NER

am, bm, bbj, ee, ha, ig, rw, lg, luo, mos, ny, pcm, sn, sw, tn, tw, wo, xh, yo, zu

B.2 Multitask learning

The (mono-lingual) multitask learning experiments described in §5.2 train and evaluate on 3333 different tasks, for the 2222 languages we evaluate on respectively. They are:

Swahili (sw)

Semantic parsing, NER, in-language QA

Bengali (bn)

Semantic parsing, in-language QA, cross-lingual QA

B.3 Joint multitask multilingual learning

B.3.1 Training

The joint multitask multilingual learning experiments in §5.1 use the union of the following datasets:

Semantic parsing

am, be, bn, fi, ha, hu, ja, pt_br, ru, sw, ta, tr, yo, zu, de_localized, en, de, es, fr, hi, th

In-language QA

ar, bn, fi, id, ko, ru, sw, te, en

Cross-lingual QA

ar, bn, fi, ko, ru, te

NER

am, bm, bbj, ee, ha, ig, rw, lg, luo, mos, ny, pcm, sn, sw, tn, tw, wo, xh, yo, zu

Machine translation

id, es, hi, yo, ja, lg, ny, ru, be, ar, de, bn, fr, tr, ig, th, fi, zu, te, ko, sw, xh, hu, ha, sn, ta, am

B.3.2 Multilingual evaluation

We evaluate models trained on the dataset described in §B.3.1 on the 4444 multilingual subsets of the training dataset. The results are reported in the first 4444 columns of Table1.

B.4 Zero-shot generalization

B.4.1 Training

For zero-shot evaluations on cross-lingual QA datasets in §5.1.2, we reuse the joint multitask multilingual training dataset described in §B.3.1.For evaluation of zero-shot unseen combinations on semantic parsing datasets in the same section, we use the union of the following datasets:

Semantic parsing

fi, hu, ja, pt_br, ru, tr, de_localized, en, de, es, fr, hi

In-language QA

ar, bn, fi, id, ko, ru, sw, te, en

Cross-lingual QA

ar, bn, fi, ko, ru, te, as, bho, brx, gbm, gom, gu, hi, hne, kn, mai, ml, mni, mr, mwr, or, pa, ps, sa, ta, ur

NER

am, bm, bbj, ee, ha, ig, rw, lg, luo, mos, ny, pcm, sn, sw, tn, tw, wo, xh, yo, zu

Machine translation

de, mni, mr, ig, id, mwr, ko, fi, ta, bn, ar, es, ja, zu, be, gbm, tr, as, bho, yo, ml, sw, hi, am, bbj, lg, pcm, ny, tw, hu, luo, rw, brx, pt_br, gu, or, gom, te, sn, wo, fr, ps, ha, hne, xh, en, sa, tn, de_localized, ur, bm, ee, th, ru, pa, kn, mai, mos

B.4.2 Zero-shot unseen combinations in cross-lingual QA

In the ‘Zero-shot Unseen Comb: QA CrossLang’ column of Table1 we report the average cross-lingual QA F1 scores of languages that are missing from the cross-lingual QA portion of the joint multitask multilingual training set (§B.3.1), but present in other datasets of the same training set. The language and locale codes of these languages are: hi, ta.

B.4.3 Zero-shot unseen languages in cross-lingual QA

In the ‘Zero-shot Unseen Lang’ column of Table1 we report the average cross-lingual QA F1 scores of languages that are missing from the joint multitask multilingual training set. The language and locale codes of these languages are: bho, brx, gbm, gom, hne, mai, mni, mr, mwr, sa, as, gu, kn, ml, or, pa, ps, ur.

B.4.4 Zero-shot unseen combinations in semantic parsing

In the ‘Zero-shot Unseen Comb: Semantic parsing’ column of Table1 we report the average exact match accuracy scores of languages that are designated as underrepresented in the XTREME-UP dataset, and held out from the training set in this experiment. These languages however appear in the machine translation portion of the training dataset. The language and locale codes of these languages are: am, be, bn, ha, sw, ta, yo, zu, th.

Appendix C Full results of Table2

In addition to the test results reported in Table2, we include validation results as well.

C.1 Multilingual experiments

C.1.1 Cross-lingual QA

Language / Locale IDValidationTest
ar81.481.481.481.480.880.880.880.8
bn77.977.977.977.983.783.783.783.7
fi81.281.281.281.279.879.879.879.8
ko85.085.085.085.083.983.983.983.9
ru78.078.078.078.080.580.580.580.5
te79.879.879.879.881.581.581.581.5
as78.578.578.578.578.778.778.778.7
bho74.674.674.674.675.875.875.875.8
brx54.454.454.454.447.447.447.447.4
gbm74.174.174.174.173.473.473.473.4
gom78.378.378.378.375.875.875.875.8
gu77.977.977.977.980.280.280.280.2
hi82.282.282.282.283.583.583.583.5
hne74.174.174.174.177.177.177.177.1
kn79.279.279.279.280.180.180.180.1
mai77.677.677.677.677.077.077.077.0
ml77.077.077.077.079.379.379.379.3
mni63.463.463.463.463.163.163.163.1
mr76.176.176.176.178.178.178.178.1
mwr76.076.076.076.075.975.975.975.9
or77.477.477.477.477.777.777.777.7
pa78.078.078.078.079.879.879.879.8
ps76.576.576.576.577.477.477.477.4
sa74.974.974.974.978.378.378.378.3
ta78.478.478.478.478.578.578.578.5
ur76.876.876.876.876.776.776.776.7
Average76.576.576.576.577.177.177.177.1
Language / Locale IDValidationTest
ar82.482.482.482.483.483.483.483.4
bn78.978.978.978.983.383.383.383.3
fi82.882.882.882.882.682.682.682.6
ko85.385.385.385.388.388.388.388.3
ru79.679.679.679.683.283.283.283.2
te81.881.881.881.881.581.581.581.5
as80.780.780.780.778.978.978.978.9
bho79.179.179.179.175.575.575.575.5
brx54.854.854.854.847.547.547.547.5
gbm78.478.478.478.473.373.373.373.3
gom78.678.678.678.676.876.876.876.8
gu79.579.579.579.578.878.878.878.8
hi83.083.083.083.083.783.783.783.7
hne78.878.878.878.877.977.977.977.9
kn80.080.080.080.079.979.979.979.9
mai79.079.079.079.075.675.675.675.6
ml79.779.779.779.779.079.079.079.0
mni60.460.460.460.460.560.560.560.5
mr78.078.078.078.077.477.477.477.4
mwr77.377.377.377.375.675.675.675.6
or79.279.279.279.278.178.178.178.1
pa79.179.179.179.178.678.678.678.6
ps78.078.078.078.077.977.977.977.9
sa77.077.077.077.077.277.277.277.2
ta79.679.679.679.678.778.778.778.7
ur78.578.578.578.576.176.176.176.1
Average78.178.178.178.177.377.377.377.3
Language / Locale IDValidationTest
ar81.381.381.381.382.882.882.882.8
bn79.779.779.779.783.583.583.583.5
fi80.780.780.780.781.881.881.881.8
ko84.884.884.884.887.087.087.087.0
ru78.678.678.678.682.082.082.082.0
te79.879.879.879.882.282.282.282.2
as79.179.179.179.178.378.378.378.3
bho77.777.777.777.775.275.275.275.2
brx58.158.158.158.151.951.951.951.9
gbm76.976.976.976.973.273.273.273.2
gom76.976.976.976.971.571.571.571.5
gu78.278.278.278.276.676.676.676.6
hi84.284.284.284.283.283.283.283.2
hne78.178.178.178.175.575.575.575.5
kn79.779.779.779.777.777.777.777.7
mai80.380.380.380.375.875.875.875.8
ml79.779.779.779.777.777.777.777.7
mni63.463.463.463.459.559.559.559.5
mr78.578.578.578.577.077.077.077.0
mwr77.377.377.377.375.975.975.975.9
or76.376.376.376.373.973.973.973.9
pa77.677.677.677.676.476.476.476.4
ps74.874.874.874.871.771.771.771.7
sa78.878.878.878.876.676.676.676.6
ta80.380.380.380.378.578.578.578.5
ur78.978.978.978.975.575.575.575.5
Average77.777.777.777.776.276.276.276.2
Language / Locale IDValidationTest
ar83.083.083.083.083.683.683.683.6
bn79.279.279.279.284.684.684.684.6
fi82.782.782.782.783.483.483.483.4
ko85.985.985.985.987.487.487.487.4
ru79.679.679.679.684.384.384.384.3
te79.979.979.979.980.380.380.380.3
as81.781.781.781.777.877.877.877.8
bho78.878.878.878.876.376.376.376.3
brx56.956.956.956.950.650.650.650.6
gbm78.178.178.178.173.673.673.673.6
gom78.478.478.478.476.376.376.376.3
gu79.779.779.779.777.177.177.177.1
hi82.282.282.282.283.683.683.683.6
hne78.178.178.178.176.376.376.376.3
kn81.281.281.281.281.081.081.081.0
mai78.778.778.778.777.677.677.677.6
ml80.480.480.480.478.778.778.778.7
mni62.562.562.562.563.463.463.463.4
mr78.378.378.378.377.777.777.777.7
mwr76.376.376.376.374.874.874.874.8
or79.579.579.579.578.178.178.178.1
pa80.280.280.280.278.278.278.278.2
ps78.678.678.678.677.077.077.077.0
sa77.977.977.977.978.078.078.078.0
ta80.880.880.880.878.878.878.878.8
ur79.579.579.579.577.677.677.677.6
Average78.478.478.478.477.677.677.677.6

C.1.2 In-language QA

Language / Locale IDValidationTest
ar87.387.387.387.388.588.588.588.5
bn85.885.885.885.886.486.486.486.4
fi89.589.589.589.590.290.290.290.2
id85.385.385.385.386.086.086.086.0
ko81.781.781.781.784.684.684.684.6
ru87.087.087.087.085.185.185.185.1
sw84.384.384.384.387.587.587.587.5
te90.490.490.490.492.492.492.492.4
en85.385.385.385.387.187.187.187.1
Average86.386.386.386.387.587.587.587.5
Language / Locale IDValidationTest
ar86.486.486.486.487.687.687.687.6
bn89.089.089.089.085.985.985.985.9
fi88.988.988.988.989.189.189.189.1
id87.887.887.887.888.788.788.788.7
ko83.483.483.483.486.286.286.286.2
ru86.686.686.686.687.887.887.887.8
sw86.686.686.686.689.689.689.689.6
te91.191.191.191.193.193.193.193.1
en85.385.385.385.388.388.388.388.3
Average87.287.287.287.288.588.588.588.5
Language / Locale IDValidationTest
ar86.486.486.486.487.787.787.787.7
bn90.790.790.790.787.787.787.787.7
fi89.689.689.689.689.789.789.789.7
id87.987.987.987.988.788.788.788.7
ko81.681.681.681.687.587.587.587.5
ru86.286.286.286.288.788.788.788.7
sw87.287.287.287.290.590.590.590.5
te91.091.091.091.092.992.992.992.9
en84.884.884.884.887.987.987.987.9
Average87.387.387.387.389.089.089.089.0
Language / Locale IDValidationTest
ar87.287.287.287.289.189.189.189.1
bn90.290.290.290.289.189.189.189.1
fi90.190.190.190.189.789.789.789.7
id87.887.887.887.888.988.988.988.9
ko82.582.582.582.585.985.985.985.9
ru87.887.887.887.889.189.189.189.1
sw86.586.586.586.590.290.290.290.2
te91.091.091.091.093.593.593.593.5
en87.487.487.487.488.988.988.988.9
Average87.887.887.887.889.489.489.489.4

C.1.3 Semantic parsing

Language / Locale IDValidationTest
am25.525.525.525.514.714.714.714.7
be34.734.734.734.725.825.825.825.8
bn38.538.538.538.525.725.725.725.7
fi33.933.933.933.925.025.025.025.0
ha28.928.928.928.922.522.522.522.5
hu31.031.031.031.023.423.423.423.4
ja36.836.836.836.825.925.925.925.9
pt-br36.836.836.836.825.525.525.525.5
ru40.640.640.640.628.028.028.028.0
sw34.734.734.734.723.123.123.123.1
ta31.431.431.431.428.828.828.828.8
tr40.240.240.240.224.424.424.424.4
yo20.520.520.520.513.113.113.113.1
zu25.125.125.125.115.215.215.215.2
de-localized31.231.231.231.221.421.421.421.4
en37.737.737.737.725.325.325.325.3
de33.933.933.933.925.525.525.525.5
es35.335.335.335.325.725.725.725.7
fr30.130.130.130.121.821.821.821.8
hi35.535.535.535.514.014.014.014.0
th36.536.536.536.518.318.318.318.3
Average33.333.333.333.322.522.522.522.5
Language / Locale IDValidationTest
am41.041.041.041.031.831.831.831.8
be52.352.352.352.343.343.343.343.3
bn51.051.051.051.040.640.640.640.6
fi53.153.153.153.145.245.245.245.2
ha46.046.046.046.035.935.935.935.9
hu50.650.650.650.638.538.538.538.5
ja52.352.352.352.338.438.438.438.4
pt-br57.757.757.757.745.845.845.845.8
ru60.760.760.760.745.445.445.445.4
sw51.051.051.051.039.939.939.939.9
ta47.347.347.347.339.539.539.539.5
tr53.153.153.153.139.939.939.939.9
yo39.339.339.339.328.828.828.828.8
zu41.041.041.041.031.731.731.731.7
de-localized57.457.457.457.444.144.144.144.1
en54.854.854.854.845.345.345.345.3
de54.854.854.854.843.043.043.043.0
es57.857.857.857.845.145.145.145.1
fr62.862.862.862.848.948.948.948.9
hi54.854.854.854.837.037.037.037.0
th55.355.355.355.342.442.442.442.4
Average52.152.152.152.140.540.540.540.5
Language / Locale IDValidationTest
am48.548.548.548.536.036.036.036.0
be59.059.059.059.050.350.350.350.3
bn60.760.760.760.747.447.447.447.4
fi62.362.362.362.351.951.951.951.9
ha53.653.653.653.645.245.245.245.2
hu60.360.360.360.348.448.448.448.4
ja56.156.156.156.144.144.144.144.1
pt-br61.961.961.961.952.952.952.952.9
ru64.464.464.464.451.151.151.151.1
sw51.951.951.951.943.843.843.843.8
ta57.357.357.357.345.945.945.945.9
tr59.459.459.459.445.645.645.645.6
yo43.543.543.543.536.436.436.436.4
zu44.844.844.844.837.637.637.637.6
de-localized65.865.865.865.850.750.750.750.7
en64.064.064.064.053.353.353.353.3
de58.258.258.258.249.249.249.249.2
es63.063.063.063.052.852.852.852.8
fr65.865.865.865.853.953.953.953.9
hi65.865.865.865.844.844.844.844.8
th65.365.365.365.350.750.750.750.7
Average58.658.658.658.647.247.247.247.2
Language / Locale IDValidationTest
am47.747.747.747.737.437.437.437.4
be58.658.658.658.647.447.447.447.4
bn54.454.454.454.446.746.746.746.7
fi59.459.459.459.450.350.350.350.3
ha49.449.449.449.441.341.341.341.3
hu56.556.556.556.546.346.346.346.3
ja56.956.956.956.944.344.344.344.3
pt-br60.760.760.760.748.848.848.848.8
ru60.760.760.760.748.448.448.448.4
sw58.658.658.658.642.742.742.742.7
ta55.655.655.655.646.546.546.546.5
tr58.258.258.258.244.644.644.644.6
yo44.444.444.444.434.034.034.034.0
zu37.237.237.237.237.637.637.637.6
de-localized61.961.961.961.944.444.444.444.4
en64.964.964.964.949.949.949.949.9
de54.454.454.454.444.144.144.144.1
es61.361.361.361.350.650.650.650.6
fr62.862.862.862.850.350.350.350.3
hi62.662.662.662.642.542.542.542.5
th59.459.459.459.447.547.547.547.5
Average56.456.456.456.445.045.045.045.0

C.1.4 Named-entity recognition (NER)

Language / Locale IDValidationTest
am74.874.874.874.872.872.872.872.8
bm73.473.473.473.468.968.968.968.9
bbj60.260.260.260.262.262.262.262.2
ee84.984.984.984.982.482.482.482.4
ha89.689.689.689.686.586.586.586.5
ig82.082.082.082.082.282.282.282.2
rw77.977.977.977.976.176.176.176.1
lg87.087.087.087.083.883.883.883.8
luo66.166.166.166.171.471.471.471.4
mos63.863.863.863.865.765.765.765.7
ny86.886.886.886.888.388.388.388.3
pcm82.182.182.182.181.681.681.681.6
sn89.389.389.389.388.688.688.688.6
sw89.689.689.689.688.888.888.888.8
tn77.177.177.177.184.484.484.484.4
tw72.972.972.972.974.074.074.074.0
wo81.581.581.581.576.576.576.576.5
xh83.783.783.783.781.181.181.181.1
yo79.179.179.179.180.780.780.780.7
zu79.279.279.279.282.282.282.282.2
Average79.179.179.179.178.978.978.978.9
Language / Locale IDValidationTest
am77.377.377.377.373.773.773.773.7
bm72.472.472.472.467.867.867.867.8
bbj54.554.554.554.566.166.166.166.1
ee82.982.982.982.982.682.682.682.6
ha90.990.990.990.990.090.090.090.0
ig84.484.484.484.485.385.385.385.3
rw80.080.080.080.077.477.477.477.4
lg87.587.587.587.586.286.286.286.2
luo77.277.277.277.276.376.376.376.3
mos67.667.667.667.666.066.066.066.0
ny87.887.887.887.889.289.289.289.2
pcm82.882.882.882.886.386.386.386.3
sn90.490.490.490.490.290.290.290.2
sw90.790.790.790.790.990.990.990.9
tn80.880.880.880.886.686.686.686.6
tw76.776.776.776.779.179.179.179.1
wo80.880.880.880.878.578.578.578.5
xh83.583.583.583.583.883.883.883.8
yo80.380.380.380.383.083.083.083.0
zu84.784.784.784.787.387.387.387.3
Average80.680.680.680.681.381.381.381.3
Language / Locale IDValidationTest
am79.179.179.179.177.477.477.477.4
bm79.979.979.979.973.873.873.873.8
bbj65.765.765.765.768.368.368.368.3
ee88.088.088.088.086.786.786.786.7
ha93.693.693.693.691.391.391.391.3
ig86.886.886.886.885.285.285.285.2
rw82.182.182.182.178.078.078.078.0
lg89.289.289.289.286.386.386.386.3
luo76.776.776.776.777.177.177.177.1
mos72.172.172.172.170.670.670.670.6
ny88.888.888.888.889.289.289.289.2
pcm87.487.487.487.488.288.288.288.2
sn92.792.792.792.792.692.692.692.6
sw91.791.791.791.790.890.890.890.8
tn82.982.982.982.988.688.688.688.6
tw81.581.581.581.581.881.881.881.8
wo81.181.181.181.180.880.880.880.8
xh85.485.485.485.486.286.286.286.2
yo82.282.282.282.284.784.784.784.7
zu85.785.785.785.787.387.387.387.3
Average83.683.683.683.683.283.283.283.2
Language / Locale IDValidationTest
am81.881.881.881.880.080.080.080.0
bm79.279.279.279.273.273.273.273.2
bbj66.566.566.566.572.172.172.172.1
ee87.287.287.287.286.186.186.186.1
ha93.693.693.693.691.191.191.191.1
ig85.985.985.985.986.586.586.586.5
rw83.583.583.583.580.080.080.080.0
lg90.290.290.290.288.188.188.188.1
luo78.678.678.678.680.480.480.480.4
mos73.073.073.073.073.773.773.773.7
ny89.589.589.589.591.891.891.891.8
pcm87.287.287.287.287.987.987.987.9
sn93.293.293.293.293.293.293.293.2
sw91.591.591.591.591.191.191.191.1
tn83.183.183.183.188.188.188.188.1
tw81.281.281.281.279.679.679.679.6
wo84.784.784.784.781.381.381.381.3
xh86.786.786.786.787.087.087.087.0
yo83.583.583.583.586.086.086.086.0
zu86.586.586.586.589.089.089.089.0
Average84.384.384.384.384.384.384.384.3

C.2 Multitask experiments

TaskValidationTest
SemParse34.734.734.734.723.023.023.023.0
QA-InLang84.284.284.284.287.387.387.387.3
NER88.688.688.688.688.788.788.788.7
Average69.269.269.269.266.366.366.366.3
TaskValidationTest
SemParse37.737.737.737.723.023.023.023.0
QA-InLang83.983.983.983.986.486.486.486.4
NER88.688.688.688.688.788.788.788.7
Average70.170.170.170.166.066.066.066.0
TaskValidationTest
SemParse38.538.538.538.526.426.426.426.4
QA-InLang82.582.582.582.585.285.285.285.2
NER89.789.789.789.788.888.888.888.8
Average70.370.370.370.366.866.866.866.8
TaskValidationTest
SemParse47.747.747.747.732.732.732.732.7
QA-InLang85.285.285.285.287.587.587.587.5
NER91.091.091.091.090.390.390.390.3
Average74.674.674.674.670.270.270.270.2
TaskValidationTest
SemParse37.237.237.237.225.725.725.725.7
QA-InLang87.687.687.687.686.786.786.786.7
QA-CrossLang78.678.678.678.683.283.283.283.2
Average67.867.867.867.865.265.265.265.2
TaskValidationTest
SemParse36.436.436.436.424.324.324.324.3
QA-InLang88.488.488.488.485.485.485.485.4
QA-CrossLang76.676.676.676.681.981.981.981.9
Average67.267.267.267.263.863.863.863.8
TaskValidationTest
SemParse44.444.444.444.429.429.429.429.4
QA-InLang89.189.189.189.183.683.683.683.6
QA-CrossLang76.176.176.176.184.484.484.484.4
Average69.869.869.869.865.865.865.865.8
TaskValidationTest
SemParse52.752.752.752.737.037.037.037.0
QA-InLang90.090.090.090.085.985.985.985.9
QA-CrossLang77.377.377.377.384.184.184.184.1
Average73.373.373.373.369.069.069.069.0

Appendix D Full results of Table1

As in AppendixC, we include validation results along with test results.

D.1 Supervised learning results

D.1.1 Cross-lingual QA

Language / Locale IDValidationTest
ar83.683.683.683.682.382.382.382.3
bn79.679.679.679.682.982.982.982.9
fi81.481.481.481.482.282.282.282.2
ko85.085.085.085.085.285.285.285.2
ru79.879.879.879.881.781.781.781.7
te80.380.380.380.384.484.484.484.4
Average81.681.681.681.683.183.183.183.1
Language / Locale IDValidationTest
ar82.182.182.182.183.683.683.683.6
bn79.379.379.379.384.584.584.584.5
fi81.881.881.881.881.381.381.381.3
ko84.584.584.584.586.086.086.086.0
ru79.579.579.579.582.782.782.782.7
te81.081.081.081.081.081.081.081.0
Average81.481.481.481.483.283.283.283.2
Language / Locale IDValidationTest
ar85.685.685.685.684.384.384.384.3
bn82.282.282.282.286.286.286.286.2
fi82.682.682.682.683.283.283.283.2
ko85.585.585.585.588.288.288.288.2
ru80.580.580.580.584.084.084.084.0
te82.182.182.182.185.085.085.085.0
Average83.183.183.183.185.285.285.285.2

D.1.2 In-language QA

Language / Locale IDValidationTest
ar83.883.883.883.884.984.984.984.9
bn88.188.188.188.183.983.983.983.9
fi86.786.786.786.787.487.487.487.4
id86.586.586.586.588.188.188.188.1
ko80.980.980.980.985.085.085.085.0
ru82.682.682.682.684.184.184.184.1
sw85.285.285.285.288.288.288.288.2
te89.289.289.289.291.791.791.791.7
en83.383.383.383.385.985.985.985.9
Average85.285.285.285.286.686.686.686.6
Language / Locale IDValidationTest
ar85.285.285.285.286.986.986.986.9
bn90.890.890.890.885.885.885.885.8
fi87.087.087.087.087.587.587.587.5
id85.485.485.485.487.187.187.187.1
ko80.480.480.480.485.385.385.385.3
ru84.984.984.984.986.286.286.286.2
sw84.884.884.884.888.788.788.788.7
te89.489.489.489.491.791.791.791.7
en84.584.584.584.586.786.786.786.7
Average85.885.885.885.887.387.387.387.3
Language / Locale IDValidationTest
ar86.486.486.486.488.788.788.788.7
bn90.490.490.490.491.191.191.191.1
fi90.090.090.090.089.589.589.589.5
id88.688.688.688.689.589.589.589.5
ko83.283.283.283.286.186.186.186.1
ru85.485.485.485.488.588.588.588.5
sw87.487.487.487.490.190.190.190.1
te91.591.591.591.593.693.693.693.6
en85.085.085.085.087.987.987.987.9
Average87.587.587.587.589.489.489.489.4

D.1.3 Semantic parsing

Language / Locale IDValidationTest
am36.036.036.036.024.524.524.524.5
be47.747.747.747.738.638.638.638.6
bn51.551.551.551.534.334.334.334.3
fi53.653.653.653.640.140.140.140.1
ha39.339.339.339.330.030.030.030.0
hu50.250.250.250.236.936.936.936.9
ja46.446.446.446.433.933.933.933.9
pt-br53.653.653.653.641.141.141.141.1
ru54.054.054.054.040.240.240.240.2
sw46.446.446.446.434.334.334.334.3
ta43.543.543.543.531.431.431.431.4
tr45.245.245.245.233.733.733.733.7
yo29.729.729.729.724.024.024.024.0
zu33.133.133.133.129.329.329.329.3
de-localized56.456.456.456.440.340.340.340.3
en50.650.650.650.639.939.939.939.9
de47.747.747.747.738.038.038.038.0
es54.954.954.954.942.242.242.242.2
fr55.655.655.655.641.941.941.941.9
hi49.749.749.749.731.431.431.431.4
th54.754.754.754.737.537.537.537.5
Average47.647.647.647.635.435.435.435.4
Language / Locale IDValidationTest
am43.143.143.143.135.535.535.535.5
be58.658.658.658.650.450.450.450.4
bn59.059.059.059.047.447.447.447.4
fi62.362.362.362.353.053.053.053.0
ha51.051.051.051.042.142.142.142.1
hu58.258.258.258.247.247.247.247.2
ja59.059.059.059.045.945.945.945.9
pt-br64.064.064.064.051.751.751.751.7
ru63.263.263.263.251.351.351.351.3
sw55.655.655.655.644.144.144.144.1
ta54.054.054.054.047.847.847.847.8
tr58.658.658.658.646.646.646.646.6
yo43.543.543.543.534.334.334.334.3
zu46.946.946.946.939.739.739.739.7
de-localized61.961.961.961.949.849.849.849.8
en61.561.561.561.553.153.153.153.1
de56.156.156.156.148.648.648.648.6
es64.764.764.764.751.751.751.751.7
fr62.862.862.862.853.753.753.753.7
hi66.566.566.566.546.446.446.446.4
th63.563.563.563.549.749.749.749.7
Average57.857.857.857.847.147.147.147.1
Language / Locale IDValidationTest
am46.946.946.946.938.338.338.338.3
be59.859.859.859.848.848.848.848.8
bn58.258.258.258.247.147.147.147.1
fi59.859.859.859.850.950.950.950.9
ha50.650.650.650.641.041.041.041.0
hu55.255.255.255.244.244.244.244.2
ja54.054.054.054.045.545.545.545.5
pt-br61.561.561.561.547.647.647.647.6
ru62.862.862.862.850.650.650.650.6
sw55.255.255.255.242.642.642.642.6
ta53.153.153.153.145.745.745.745.7
tr57.757.757.757.745.445.445.445.4
yo43.543.543.543.534.134.134.134.1
zu41.441.441.441.437.637.637.637.6
de-localized60.460.460.460.445.645.645.645.6
en61.161.161.161.150.750.750.750.7
de57.757.757.757.745.545.545.545.5
es60.760.760.760.750.450.450.450.4
fr65.865.865.865.852.552.552.552.5
hi63.963.963.963.945.245.245.245.2
th61.861.861.861.848.248.248.248.2
Average56.756.756.756.745.645.645.645.6

D.1.4 NER

Language / Locale IDValidationTest
am68.868.868.868.870.870.870.870.8
bm67.367.367.367.363.563.563.563.5
bbj52.452.452.452.461.261.261.261.2
ee80.580.580.580.578.678.678.678.6
ha87.587.587.587.584.884.884.884.8
ig78.278.278.278.282.182.182.182.1
rw77.277.277.277.271.671.671.671.6
lg85.485.485.485.482.482.482.482.4
luo61.761.761.761.766.966.966.966.9
mos62.962.962.962.961.661.661.661.6
ny83.483.483.483.485.585.585.585.5
pcm79.979.979.979.981.381.381.381.3
sn84.184.184.184.184.484.484.484.4
sw87.287.287.287.287.187.187.187.1
tn77.877.877.877.884.384.384.384.3
tw74.074.074.074.074.874.874.874.8
wo75.275.275.275.273.973.973.973.9
xh73.373.373.373.377.277.277.277.2
yo71.471.471.471.475.075.075.075.0
zu75.475.475.475.478.778.778.778.7
Average75.275.275.275.276.376.376.376.3
Language / Locale IDValidationTest
am74.474.474.474.476.776.776.776.7
bm75.175.175.175.171.071.071.071.0
bbj61.761.761.761.766.166.166.166.1
ee85.185.185.185.183.583.583.583.5
ha92.192.192.192.190.590.590.590.5
ig85.485.485.485.486.186.186.186.1
rw81.881.881.881.876.876.876.876.8
lg87.387.387.387.385.385.385.385.3
luo77.177.177.177.175.975.975.975.9
mos66.766.766.766.766.466.466.466.4
ny87.187.187.187.187.987.987.987.9
pcm85.485.485.485.486.986.986.986.9
sn91.891.891.891.890.690.690.690.6
sw90.790.790.790.790.190.190.190.1
tn82.082.082.082.087.687.687.687.6
tw78.478.478.478.477.977.977.977.9
wo79.479.479.479.478.978.978.978.9
xh82.482.482.482.485.585.585.585.5
yo79.579.579.579.581.281.281.281.2
zu82.982.982.982.986.186.186.186.1
Average81.381.381.381.381.681.681.681.6
Language / Locale IDValidationTest
am81.181.181.181.179.479.479.479.4
bm80.480.480.480.473.773.773.773.7
bbj63.663.663.663.670.570.570.570.5
ee87.087.087.087.086.186.186.186.1
ha92.792.792.792.791.291.291.291.2
ig85.485.485.485.486.986.986.986.9
rw83.983.983.983.980.280.280.280.2
lg89.189.189.189.187.287.287.287.2
luo74.774.774.774.778.978.978.978.9
mos71.271.271.271.272.572.572.572.5
ny89.489.489.489.490.490.490.490.4
pcm86.886.886.886.888.988.988.988.9
sn91.991.991.991.992.792.792.792.7
sw91.691.691.691.691.491.491.491.4
tn83.283.283.283.288.988.988.988.9
tw79.479.479.479.478.278.278.278.2
wo83.383.383.383.381.981.981.981.9
xh85.585.585.585.586.086.086.086.0
yo83.083.083.083.085.885.885.885.8
zu84.984.984.984.988.788.788.788.7
Average83.483.483.483.484.084.084.084.0

D.2 Zero-shot results

D.2.1 Zero-shot unseen combinations: cross-lingual QA

Language / Locale IDValidationTest
hi82.182.182.182.183.283.283.283.2
ta79.379.379.379.380.180.180.180.1
Average80.780.780.780.781.681.681.681.6
Language / Locale IDValidationTest
hi84.484.484.484.484.484.484.484.4
ta80.780.780.780.779.879.879.879.8
Average82.582.582.582.582.182.182.182.1
Language / Locale IDValidationTest
hi84.984.984.984.985.385.385.385.3
ta81.981.981.981.980.480.480.480.4
Average83.483.483.483.482.882.882.882.8

D.2.2 Zero-shot unseen combinations: semantic parsing

Language / Locale IDValidationTest
am13.813.813.813.810.210.210.210.2
be35.635.635.635.630.130.130.130.1
bn31.031.031.031.021.621.621.621.6
ha20.120.120.120.116.216.216.216.2
sw34.334.334.334.322.022.022.022.0
ta20.520.520.520.511.411.411.411.4
yo7.57.57.57.58.98.98.98.9
zu17.217.217.217.216.016.016.016.0
th37.637.637.637.623.023.023.023.0
Average24.224.224.224.217.717.717.717.7
Language / Locale IDValidationTest
am24.724.724.724.717.617.617.617.6
be54.054.054.054.042.842.842.842.8
bn45.245.245.245.235.635.635.635.6
ha30.530.530.530.525.525.525.525.5
sw43.143.143.143.132.232.232.232.2
ta30.530.530.530.523.123.123.123.1
yo21.321.321.321.316.116.116.116.1
zu27.627.627.627.624.824.824.824.8
th55.955.955.955.938.438.438.438.4
average37.037.037.037.028.428.428.428.4
Language / Locale IDValidationTest
am46.946.946.946.938.338.338.338.3
be59.859.859.859.848.848.848.848.8
bn58.258.258.258.247.147.147.147.1
ha50.650.650.650.641.041.041.041.0
sw55.255.255.255.242.642.642.642.6
ta53.153.153.153.145.745.745.745.7
yo43.543.543.543.534.134.134.134.1
zu41.441.441.441.437.637.637.637.6
th61.861.861.861.848.248.248.248.2
Average52.352.352.352.342.642.642.642.6

D.2.3 Zero-shot unseen languages

Language / Locale IDValidationTest
as80.780.780.780.779.679.679.679.6
bho78.978.978.978.976.076.076.076.0
brx45.745.745.745.741.041.041.041.0
gbm74.674.674.674.673.173.173.173.1
gom77.877.877.877.875.375.375.375.3
gu80.980.980.980.979.879.879.879.8
hne77.177.177.177.178.578.578.578.5
kn80.980.980.980.980.580.580.580.5
mai79.379.379.379.376.276.276.276.2
ml81.081.081.081.081.481.481.481.4
mni52.352.352.352.352.952.952.952.9
mr78.778.778.778.778.778.778.778.7
mwr80.480.480.480.476.976.976.976.9
or77.877.877.877.877.577.577.577.5
pa79.379.379.379.380.480.480.480.4
ps77.077.077.077.078.278.278.278.2
sa78.178.178.178.180.580.580.580.5
ur79.379.379.379.378.578.578.578.5
Average75.575.575.575.574.774.774.774.7
Language / Locale IDValidationTest
as81.381.381.381.379.879.879.879.8
bho79.079.079.079.077.077.077.077.0
brx46.546.546.546.541.241.241.241.2
gbm73.473.473.473.472.572.572.572.5
gom77.777.777.777.774.774.774.774.7
gu81.981.981.981.980.080.080.080.0
hne78.178.178.178.177.277.277.277.2
kn82.782.782.782.780.180.180.180.1
mai81.381.381.381.377.577.577.577.5
ml81.681.681.681.680.980.980.980.9
mni48.848.848.848.851.151.151.151.1
mr80.080.080.080.080.380.380.380.3
mwr77.777.777.777.775.875.875.875.8
or77.877.877.877.879.279.279.279.2
pa80.080.080.080.079.879.879.879.8
ps77.077.077.077.078.578.578.578.5
sa77.577.577.577.578.978.978.978.9
ur78.378.378.378.377.777.777.777.7
Average75.675.675.675.674.674.674.674.6
Language / Locale IDValidationTest
as82.682.682.682.682.582.582.582.5
bho78.778.778.778.778.778.778.778.7
brx50.350.350.350.347.047.047.047.0
gbm77.577.577.577.575.475.475.475.4
gom80.380.380.380.378.678.678.678.6
gu82.782.782.782.782.682.682.682.6
hne79.079.079.079.079.879.879.879.8
kn82.382.382.382.382.582.582.582.5
mai81.881.881.881.879.979.979.979.9
ml82.182.182.182.181.081.081.081.0
mni56.256.256.256.259.059.059.059.0
mr79.779.779.779.780.780.780.780.7
mwr79.579.579.579.577.977.977.977.9
or81.481.481.481.481.181.181.181.1
pa79.979.979.979.981.781.781.781.7
ps78.978.978.978.980.680.680.680.6
sa80.180.180.180.181.681.681.681.6
ur80.280.280.280.279.979.979.979.9
Average77.477.477.477.477.277.277.277.2
Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (2024)
Top Articles
Latest Posts
Article information

Author: Arielle Torp

Last Updated:

Views: 6532

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Arielle Torp

Birthday: 1997-09-20

Address: 87313 Erdman Vista, North Dustinborough, WA 37563

Phone: +97216742823598

Job: Central Technology Officer

Hobby: Taekwondo, Macrame, Foreign language learning, Kite flying, Cooking, Skiing, Computer programming

Introduction: My name is Arielle Torp, I am a comfortable, kind, zealous, lovely, jolly, colorful, adventurous person who loves writing and wants to share my knowledge and understanding with you.