Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (2024)

Chu-Cheng Lin , Xinyi Wang¹¹footnotemark: 1 , Jonathan H. Clark, Han Lu, &Yun Zhu, Chenxi Whitehouse, Hongkun Yu

Google

equal contribution, {kitsing,xinyiwang}@google.com

Abstract

Adapting pretrained large language models(LLMs) to various downstream tasks in tens or hundreds of human languages is computationally expensive.Parameter-efficient fine-tuning(PEFT) significantly reduces the adaptation cost, by tuning only a small amount of parameters. However, common PEFT methods LoRA(Hu etal., 2022) suffer from suboptimal performance on diverse dataset mixtures, due to aggressive parameter tying and negative interference among different datasets.In this work, we propose Featurized Low-rank Mixtures(FLix), a novel PEFT method designed for effective multitask multilingual adaptation.FLix associates each unique dataset feature, such as the dataset’s language or task, with its own low-rank weight update parameters.By composing feature-specific parameters for each dataset, FLix can accommodate diverse dataset mixtures and generalize better to unseen datasets. Our experiments show that FLix leads to significant improvements over a variety of tasks for both supervised learning and zero-shot settings with gains of up to $14.2$ in exact match points in zero-shot semantic parsing.

1 Introduction

Large language models(LLMs) have shown impressive performance on various real world applications in many different human languages (Brown etal., 2020; Soltan etal., 2022; Google etal., 2023).While there have been notable successes aligning an LLM to become a generalist which can follow human instructions to perform different tasks (Ouyang etal., 2022), there are also significant interests in adapting an LLM into specialists, each of which works on its specific task that is known a priori.

Intuitively, LLM adaptation can be done by continued training (or fine-tuning) on target languages and datasets. However, fine-tuning all model parameters on every dataset can be computationally and financially prohibitive.Parameter-efficient fine-tuning methods(PEFT) such as LoRA (Hu etal., 2022) and prompt tuning (Lester etal., 2021) reduce the computational costs of adapting LLMs to a downstream task. They parametrize LLM fine-tuning with a small set of trainable parameters, keeping the majority of LLM parameters frozen.

While PEFT has been widely used to adapt LLMs to a single dataset, very few prior works studied the best practices for adapting the model jointly on many different use cases.Vu etal. (2022) proposed adding a multilingual pretraining stage to prompt tuning using multilingual unlabeled data to improve zero-shot summarization. Wang etal. (2023b) proposed a multitask prompt tuning method that learns a single soft prompt which could be used to be adapt to other target tasks. However, these methods only considered either multilingual or multitask datasets, while we generally would like to adapt the model such that it generalizes along multiple axes(i.e. tasks and languages). Moreover, they require multiple tuning stages, which limits their applicability in practice.

In this paper, we propose Featurized Low-rank Mixtures (FLix), an extension of LoRA for modeling diverse dataset mixtures. Compared to LoRA which applies the same low-rank adaptation for all inputs, FLix parametrizes such updates to decompose linearly as a sum of feature-specific low-rank adaptations, each associated with an active dataset feature, such as language or task ID.Under FLix, different adaptations can be learned for different features. The compositional nature of FLix also provides an inductive bias for learning generalizable adaptations. Moreover,FLix is generally computationally efficient: it only needs to activate a tiny fraction of trainable parameters for each input, making both tuning and deployment efficient.FLix is related to prior works on sub-network composition(Lin etal., 2021; Ilharco etal., 2022), which show that models fine-tuned on different tasks could be composed together as a single model.

In this article, we contribute:

•
a modeling formulation that disentangles learning of tasks and languages in a way that improves generalization quality while remaining highly efficient;
•
experimental evidence that the model improves generalization quality across four very different tasks: named entity recognition, semantic parsing, in-language question answering, and cross-lingual question answering; and
•
evidence for the hypothesis that imbuing models with meta-data such as task and language improves quality, even when compared with powerful modern adaptation methods such as LoRA.

The rest of this paper is structured as follows: In §2, we first formulate the multitask multilingual learning setting, and discuss the challenges associated with the current PEFT methods.In §3, we describe the design of FLix, and how FLix adapts to zero-shot scenarios.We also propose to train FLix with feature dropout, which encourages positive transfer and generalization. We evaluate FLix on multitask or multilingual learning setting(§5.2), and later on joint multilingual multitask tuning and zero-shot generalization(§5.1). The experiment results and ablations show that FLix brings significant improvements over standard PEFT methods for all settings, and it is especially effective when used with very diverse training mixture and at zero-shot generalizations (§5, §6).

Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (1)

Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (2)

2 Multitask multilingual learning

We consider a multitask learning (Caruana, 1997) setting where datasets are indexed using task-language tuples.We assume that there are $N$ tasks $\{w_{1}\ldots w_{N}\}$ and $M$ languages $\{\ell_{1}\ldots\ell_{M}\}$ . And our goal is to serve fine-tuned LLMs for each of these $N\cdot M$ task-language combinations. It is worth noting that two special cases of this setup are multilingual modeling ( $N=1$ ) and multitask modeling ( $M=1$ ).

Parameter-efficient tuning(PEFT) is a popular method to adapt a pretrained LLM to downstream tasks without incurring large computational cost. In this work, we want to use PEFT to support $O(M\cdot N)$ languages and tasks.

2.1 Low-rank Adaptation

We focus on Low-rank Adaptation(LoRA)(Hu etal., 2022), an effective PEFT method that incurs minimal additional inference cost.Let $W_{0}\in\mathbb{R}^{d\times k}$ be the pretrained weight matrix, LoRA keeps the much larger weight matrix $W_{0}$ unchanged, and instead only optimizes low-rank factorized weight adaptation matrix $\Delta W={\phi}^{A}{\phi}^{B}$ . The final weight matrix is

\displaystyle W=W_{0}+\Delta W=W_{0}+{\phi}^{A}{\phi}^{B},

(1)

where ${\phi}^{A}\in\mathbb{R}^{d\times r}$ , ${\phi}^{B}\in\mathbb{R}^{r\times k}$ . Empirically LoRA often compares competitively with full fine-tuning, even when $r\ll\min(d,k)$ . LoRA can thus significantly reduce the size of trainable parameters.

2.2 Challenges

PEFT methods such as LoRA have been shown to be very effective for tuning LLMs, achieving comparable results to full fine-tuning while incurring only a fraction of the computational cost.However, there are several challenges with the multitask multilingual learning problem that the current PEFT methods might not be able to address.

Interference among different datasets

Multitask multilingual training with PEFT requires one to fine-tune an LLM on a diverse data mixture, from up to $N\cdot M$ different datasets spanning $N$ tasks and $M$ languages. This approach has significantly lower overhead than modeling each dataset individually, and allows positive transfer. However, training a single set of PEFT parameters over all datasets could also lead to negative interference among tasks and languages that are dissimilar.

Generalizing to unseen task and language

Publicly available wide-coverage multitask and multilingual datasets are often incomplete. For example, many task-language combinations are missing in XTREME-UP. Moreover, some underrepresented languages may still be missing from such datasets.Standard PEFT methods could have difficulty generalizing to unseen task-language combinations and unseen languages: they simply optimize a single set of parameters on all tasks without explicit modeling assumptions that capture the relationships and similarities between different datasets. And it is not clear how to transform such PEFT parameters to a new task or language at inference time.

3 Featurized Low-Rank Mixtures

We propose Featurized Low-Rank Mixtures(FLix), an effective multitask PEFT method that supports diverse training data mixture, and excels at zero-shot generalization to new task-language combinations.Under FLix, NLP tasks and languages are featurized as discrete features. And each feature is associated with a low-rank weight update matrix. Figure1 shows the training and inference processes of FLix.

3.1 Model Architecture

Given a diverse data mixture from $N$ tasks where each tasks are in $M$ languages, first we define a set of $D=N+M$ features where each feature could represent either a task, a language, or any other data attribute. We assign a low-rank factorized weight matrix ${\phi}^{A}_{i}{\phi}^{B}_{i}$ for each feature $i\in[1,D]$ .

Let ${\bm{\mathbf{x}}}$ be an input to the model, let $f({\bm{\mathbf{x}}})$ represent the features of ${\bm{\mathbf{x}}}$ , where $f_{i}({\bm{\mathbf{x}}})=1$ indicates that ${\bm{\mathbf{x}}}$ has feature $i$ , and $f_{i}({\bm{\mathbf{x}}})=0$ otherwise.

\displaystyle W({\bm{\mathbf{x}}})

\displaystyle=W_{0}+\sum_{i=1}^{D}{f_{i}({\bm{\mathbf{x}}})}{\phi}^{A}_{i}{%\phi}^{B}_{i},

(2)

where ${\phi}^{A}_{i}\in\mathbb{R}^{d\times r_{i}}$ , ${\phi}^{B}_{i}\in\mathbb{R}^{r_{i}\times k}$ , and $r_{i}$ is the maximum rank of the the $i$ -th feature’s adaptation matrix.Note that compared to LoRA in equation1 that applies the same $\Delta W$ (and therefore same $W$ ) for all inputs, FLix uses different adaptation matrices based on the features of the input data $f({\bm{\mathbf{x}}})$ .

Feature dropout.

One potential problem of training FLix is that the model might become overly reliant on the feature annotation of the training dataset, limiting positive transfer and making the model brittle to inputs different from training distribution.We randomly turn off a subset of active features at training time with a predetermined feature dropout probability.Experiments in §6 show that feature dropout brings consistent gains to FLix.

Exploiting feature sparsity for low training and serving costs.

Note that equation2 implies that on input ${\bm{\mathbf{x}}}$ , $W({\bm{\mathbf{x}}})$ is not a function of $\{\phi^{A}_{i},\phi^{B}_{i}\}$ if $f_{i}({\bm{\mathbf{x}}})=0$ . In other words, weights of unused features of input ${\bm{\mathbf{x}}}$ are not needed in either training or serving time. While the number of trainable parameters under FLix grows linearly with the number of features $D$ , the feature count of each input ${\bm{\mathbf{x}}}$ remains a constant; and this is a relatively small value in our multitask multilingual learning settings. Therefore, the compute costs of FLix could still remain constant when scaling to increasingly more tasks and languages.

3.2 Zero-shot Composition

We find FLix to be particularly effective at zero-shot generalization, likely because of the explicit feature composition assumption. While previous work proposed using language-specific modules at pretraining time to enhance crosslingual transfer (Vu etal., 2022; Pfeiffer etal., 2023), our work shows that sparse modularized PEFT architecture is also effective for directly adapting dense pretrained models to unseen datasets.

In this paper, we consider how FLix adapts to unseen datasets in two different zero-shot scenarios:

Unseen combinations.: We want to do inference for a dataset that has a combination of active features that did not appear in the training mixture. For example, say the training data mixture contains QA data in French and semantic parsing data in Bengali; and we want to test the model on QA in Bengali. FLix naturally supports such features; and no change is required while applying equation2.
Unseen languages.: The test data could have a subset of features that are not covered by any of the dataset in the training mixture. Specifically, we focus on the setting where the test data is from a new language unseen during training. In this case, we only use the task feature of the data to calculate the model weights in equation2.

In §5, we show that FLix significantly outperforms the baselines for both types of zero-shot settings.

4 Experiments

For all experiments, we use instruction fine-tuned PaLM-2 small Google etal. (2023) as the base model. We evaluate our method on a variety of tasks, languages, and data mixtures to verify that it generalizes to different use cases.

Datasets.

We use the XTREME-UP dataset (Ruder etal., 2023) and format the input data with the language and task information(Wang etal., 2023a).We use different language and task subsets from XTREME-UP. Dataset details are described in §5.¹¹1A list of all experiments’ training and evaluation tasks can also be found in AppendixB.

Metrics.

We report F1 scores for both in- and cross-lingual QA tasks, and NER tasks. For semantic parsing, we report exact-match accuracy. All numbers are normalized between $[0,100]$ . We report average normalized metric scores for (monolingual) multitask experiments.

Hyperparameters.

We use a batch size of $512$ during training for all experiments. For FLix, we also set the feature dropout probability to be $0.7$ .

Ranks.

Compute-matched LoRA baselines have rank- $6$ adaptation matrices²²2Strictly speaking their ranks $\leq 6$ ; we use this terminology loosely to reduce clutter. in all experiments. In this paper, every dataset has both task and language features; and we allocate feature-specific parameter counts such that the trainable parameters of active features under FLix always match compute-matched baselines. More specifically, we let task and language features have either rank- $2$ or rank- $4$ adaptation matrices. We allocate smaller matrices (rank- $2$ ) for task features in multitask experiments (§5.2), and larger matrices in both multilingual (§5.2) and joint multitask-multilingual (§5.1) experiments. We adjust the ranks of language features’ adaptation matrices accordingly, to ensure that every dataset’s adaptation matrices have a total rank $=6$ .

Model selection.

We evaluate on validation splits every 200 iterations, and train for a maximum of 2000 steps. For every multilingual and multitask task, we choose the checkpoint that has the best averaged metric numbers across languages or tasks, and subsequently evaluate on the test splits.

Baselines.

We compare to the vanilla LoRA method under several different settings to ensure the fairness of comparison:

•
Compute-matched sets the rank $r$ of the vanilla LoRA model to be equivalent to the maximum sum of the rank of feature-specific adaptations under FLix. This ensures LoRA and FLix uses comparable computation during training and inference.
•
Param-matched sets the rank $r$ of the vanilla LoRA model such that the total number of trainable parameters is the same as its FLix counterpart.³³3The number of trainable parameters in FLix grows linearly with the number of features in a dataset. Therefore the param-matched counterpart’s rank can be a relatively large number from around 20 to 90, depending on the task.

5 Results

5.1 Study A: Joint Multitask Multilingual Learning

Method	NER	Semantic Parsing	QA InLang	QA CrossLang	Zero-shot Unseen Comb: QA CrossLang	Zero-shot Unseen Comb: Semantic parsing	Zero-shot Unseen Lang: QA CrossLang	Avg.
Compute-matched	76.3	35.4	86.6	83.1	$82.114\,952\,09$	$17.707\,970\,3$	$74.568\,217\,38$	65.1
Param-matched	$81.645\,797\,73$	47.1	87.3	$83.189\,236\,96$	$80.025\,490\,57$	$28.435\,936\,19$	$74.716\,568\,42$	68.9
FLix	84.0	$45.590\,648\,65$	89.4	85.2	82.8	42.6	77.2	72.4

In this section, we evaluate the performance of FLix and baselines on a diverse data mixture, where the tuning data contain different languages and tasks.

5.1.1 Multitask multilingual tuning

We conduct multitask multilingual tuning on both vanilla LoRA and our proposed FLix model. This setting shows one of the primary strengths of FLix: The ability to simultaneously generalize along multiple axes (tasks and languages) without suffering from negative task interference. We evaluate on four tasks from XTREME-UP covering a wide variety of use cases and languages: crosslingual QA, in-language QA, semantic parsing, NER. In addition, we also add training data from machine translation to the data mixture since it has the best language coverage which allows cross-lingual transfer.We use all languages in in-language QA, semantic parsing, NER, and a subset of languages in cross-lingual QA as training data.⁴⁴4We only use the languages included in the original XOR-TyDi QA(Asai etal., 2021): Arabic, Bengali, Japanese, Finish, Korean, Russian, Telugu. The rest of the cross-lingual QA languages — all Indic languages — are evaluated for zero-shot generalization in §5.1.2. In addition, we also include the corresponding machine translation datasets of languages from these 4 datasets.

FLix significantly outperforms baselines under multitask multilingual tuning.

The overall results are listed in Table1. We can see that FLix out-performs the the best LoRA baseline for all tasks other than semantic parsing. While it loses to param-matched LoRA on semantic parsing, FLixhas significantly less computational cost in comparison and it significantly out-performs the vanilla LoRA with the same computational cost.

5.1.2 Zero-shot Generalization

To evaluate both zero-shot generalizations (§3.2) we prepare two different training datasets:

•
Holding out languages in cross-lingual QA. We reuse the training dataset from the joint multitask multilingual learning setup (§5.1) to evaluate both unseen feature combinations and unseen languages in the cross-lingual QA task. For unseen combinations, we evaluate the performance of the set of languages where the languages were present in other multilingual tasks in the training dataset; and for unseen languages, we evaluate on the set of languages that do not appear in other multilingual tasks in the training dataset.
•
Holding out languages in semantic parsing. In this scenario, we use portions of crosslingual QA, in-language QA, semantic parsing, NER, and machine translation datasets from XTREME-UP as our training dataset. We include full crosslingual QA, in-language QA, NER datasets; but we do not include underrepresented languages in the semantic parsing portion of the training data. We also only include machine translation portions of languages that are already available in the other $4$ subsets, as in §5.1. And we evaluate the unseen combination performance of all held-out languages under semantic parsing.⁵⁵5These held-out languages are unseen combinations, rather than unseen languages, since they are available in the machine translation datasets of XTREME-UP.

Our method is very effective at both types of zero-shot generalization.

The comparison of FLix and baseline methods can be found in the rightmost 3 columns in Table1. We can see that FLix brings significant improvements to both unseen combinations and unseen languages on both cross-lingual QA and semantic parsing. This is likely because FLix allows one to select the subset of parameters most relevant to the features of the new test data, allowing more effective zero-shot generalization.

5.2 Study B: Multitask or multilingual learning

In this section, we examine the performance of our method and the baselines on data mixtures with a single type of feature. That is, we train and evaluate on datasets of a single task from several different languages(multilingual learning), or datasets of a single language with different tasks(multitask learning). We also add additional baselines for this setting where we train a separate LoRA model for each individual dataset (denoted as Single-Lang and Single-Task in Table2). This approach alleviates the capacity constraint of vanilla LoRA on diverse datasets, but adds more engineering overhead (as we briefly noted in §1) and cannot leverage positive transfer between datasets.

Specifically, we use subsets of the XTREME-UP dataset to construct these datasets:

•
Multilingual learning. We experiment on four tasks: NER, semantic parsing, in-language QA, and cross-lingual QA. For each task, we train and evaluate on all languages available in the XTREME-UP dataset. Unlike §5.1, we train a separate multilingual model for each task in this setting.
•
Multitask learning. We evaluate on two under-represented languages: Swahili and Bengali. We use the subset of tasks available for each language in XTREME-UP. For Swahili, we use semantic parsing, QA (in-language), and NER. For Bengali we use semantic parsing and QA (both in-language and cross-lingual).⁶⁶6The mismatch between task choices between these two languages is due to the sparsity of available datasets in XTREME-UP. Unlike §5.1, we train a separate multitask model for each language in this setting.

Method	Multilingual Learning				Multitask Learning		Avg.
Method	NER	Semantic Parsing	QA InLang	QA CrossLang	Swahili	Bengali	Avg.
Compute-matched	$81.297\,913\,79$	$40.491\,851\,81$	$88.469\,852\,87$	$77.274\,442\,82$	$66.044\,353\,56$	$63.836\,740\,49$	69.6
Param-matched	$83.228\,062\,69$	47.2	$89.040\,080\,6$	$76.191\,510\,71$	$66.778\,000\,91$	$65.786\,265\,69$	71.3
FLix	84.3	$45.030\,633\,47$	89.4	77.6	70.2	69.0	72.6

No baseline method consistently out-performs other baselines.

We report the performance of our method and the baselines in Table2. For each experiment, we list the average result over all languages or tasks in the datasets. First, we find that among the different baseline methods, there is no method that consistently out-performs others. Specifically, param-matched LoRA tends to have advantage for multitask learning and tasks that are very different from pretraining(semantic parsing and NER), while compute-matched LoRA appears to be superior on multilingual QA tasks. We suspect that this is because param-matched LoRA benefits from its higher capacity for the model to learn to generate into targets that are very different from the pretraining data, and it is also helpful for supporting generation into diverse target tasks. However, param-matched LoRA has significantly higher numbers of trainable parameters. This could lead to much higher computational overhead compared to compute-matched LoRA. Moreover, we observe that compute-matched LoRA is actually more competitive on tasks such as crosslingual QA, likely because it reduces over-fitting by tuning a much smaller number of parameters.

FLix achieves much better performance than vanilla LoRA with the same computation budgets.

FLix consistently outperforms compute-matched LoRA baseline on all settings, achieving significant gains without adding additional serving cost. Furthermore, our proposed method also outperforms param-matched LoRA on five out of the six data mixtures we evaluated.

We hypothesize that param-matched LoRA is better than FLix at semantic parsing but worse at other tasks because semantic parsing requires the LLM to generate into structured data that are very different from the pretraining data distribution, which might require particularly large model capacity to learn. In fact, Table2 shows that param-matched LoRA with a higher rank is much better than compute-matched LoRA for semantic parsing, while being worse or comparable on question answering tasks. While param-matched LoRA could be particularly helpful for the semantic parsing task, it requires more computational resources to scale to large number of datasets. These results indicate that our method is an effective and computationally-efficient strategy for tuning LLMs on diverse data mixtures.

6 Analysis and Ablations

6.1 Effect of feature dropout

Feature dropout strength	Validation	Test
$p=0.7$	$87.572\,028\,27$	$89.293\,668\,96$
$p=0.5$	$87.310\,785\,93$ ( $-0.3$ )	$88.794\,375\,95$ ( $-0.5$ )
$p=0.3$	$87.106\,421\,58$ ( $-0.5$ )	$88.668\,604\,53$ ( $-0.6$ )
$p=0.0$	$86.893\,094\,38$ ( $-0.7$ )	$88.923\,055\,86$ ( $-0.4$ )

In this section, we evaluate the effectiveness of feature dropout for FLix. We compare the performance of FLix with and without feature dropout for in-language QA task using multilingual training, and the results are in Table3. We can see that removing feature dropout leads to significant performance drop for both validation and test set. We hypothesize that this is because feature dropout is an effective regularization that encourages FLix to utilize the shared parameters for positive transfer between tasks and languages.

Rank	QA CrossLang	QA InLang	Semantic Parsing	NER
$2$	77.6	89.4	45.0	84.3
$1$	76.5( $-1.1$ )	89.3( $-0.1$ )	45.2( $+0.2$ )	83.1( $-1.2$ )

6.2 Effect of rank for FLix

In multilingual experiments (Table2) the language features only have rank- $2$ feature-specific weight update matrices.Such low rank configurations help reduce the overall parameter count.In this section, we examine how well FLix performs under an even smaller budget. Table4 shows the results of FLix on multitask multilingual tuning with rank set to 4 and 2. We can see that reducing the capacity of the feature-specific weight update matrices actually only lead to a small drop in performance on most tasks, indicating that FLix could be effective under even restrictive computational requirements.

6.3 FLix performs increasingly better on a diverse training dataset

Inducing Generalization across Languages and Tasks using Featurized Low-Rank Mixtures (3)

In §5 we examined the performance of FLix and baselines under each data mixtures. Here we want to compare how different methods perform when using increasingly more diverse data mixtures. Specifically, we compare the change in task performance when using the multitask multilingual mixture as opposed to using only the multilingual data for each task. Figure2 shows the results of FLix and the baselines. We can see that vanilla LoRA suffers from negative transfer Wang etal. (2018). In particular, the compute-matched version with a smaller rank has significant decrease in performance on a diverse multitask multilingual training dataset. On the other hand, FLix is able to generally maintain similar or slightly higher performance when using the more diverse data mixture. This result indicates that FLix is a superior PEFT method to scale to large number of tasks and languages.

7 Related Work

Multilingual/multitask PEFT

While most prior works on parameter-efficient tuning focus on tuning the model for a single task, some recent works propose post-hoc composition pretrained PEFT modules to generalize to new task or language(Pfeiffer etal., 2020; Huang etal., 2023; Wang etal., 2021; Chronopoulou etal., 2023). These methods generally require separate training runs for each dataset, while our work propose a parameter-efficient tuning method that adapts the LLM using diverse data mixture in a single training run. Vu etal. (2022) proposed to add a multilingual pretraining stage to prompt tuning, which shows some improvements for zero-shot generalization when adapting to cross-lingual summarization task. In comparison, our proposed FLix focuses on tuning using multilingual data in many downstream tasks without additional training on unlabeled pretrainining data. Wang etal. (2023b) proposed a multitask prompt tuning method to learn a single prompt which could be used to adapt to many other target tasks. Similarly, this method requires multiple training stage while our method allows training end-to-end. Both of these methods are built upon prompt tuning using either multilingual training in a single task or multitask tuning in English, while our method supports diverse datasets in different tasks languages or any other arbitrary features.

Mixture-of-experts Models

Mixture-of-experts models (MoEs)(Shazeer etal., 2017; Lepikhin etal., 2020; Jawahar etal., 2023) are effective at increasing the model capacity by adding multiple expert parameters which could be activated differently to support different inputs. This architecture has been used to improve both pretrained models(Lepikhin etal., 2020; Jawahar etal., 2023) and parameter-efficient tuning methods(Zadouri etal., 2023; Zhu etal., 2023). Since MoEs often adds more computational cost, many works try to reduce the cost and improve the effectiveness of the model by either task-level routing (e.g., Task MoEs proposed by Kudugunta etal. (2021)) or encouraging sparsity of the experts(Shazeer etal., 2017; Lepikhin etal., 2020). FLix resembles Task MoEs in that FLix leverages task information as well. But FLix has additional composition capabilities thanks to its featurization.

Modularized Pretrained Models

Previous work proposed to add language-specific parameters to multilingual pretrained models(Pfeiffer etal., 2022; 2023) or machine translation models(Zhang etal., 2020). These works showed that language-specific modules often bring significant improvements to multilingual tasks. And they are especially helpful for zero-shot cross-lingual transfer. While prior works added language-specific modules during multilingual pretraining,in this paper we focus on the problem of adapting a pretrained model to a diverse mixture with many tasks and languages.

8 Future Work

There are several promising future directions for our work. While FLix achieves good zero-shot performance, it could potentially benefit from methods that automatically learn to select and compose parameters pretrained on different features for unseen data. It is also interesting to examine other applications of FLix: we encoded task and language informations as features in this work. But potentially other properties, such as modality, could also be featurized under FLix.

9 Conclusion

In this paper, we propose Featurized Low-Rank Mixtures(FLix), an effective parameter-efficient tuning method to fine-tune pretrained LLMs to data mixtures containing datasets in diverse tasks and languages. Our experiments show that FLix leads to significantly better performance on multitask multilingual fine-tuning compared to standard LoRA with little computational overhead. We also find that FLix achieves much better zero-shot generalization to new languages and task language combination unseen at training time.

References

Asai etal. (2021)Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and HannanehHajishirzi.XOR QA: Cross-lingual open-retrieval question answering.In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, DilekHakkani-Tur, IzBeltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty,and Yichao Zhou (eds.), NAACL, Online, June 2021. Association forComputational Linguistics.
Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, RewonChild, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, ChrisHesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,and Dario Amodei.Language models are few-shot learners.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin(eds.), Advances in Neural Information Processing Systems, volume33,pp. 1877–1901. Curran Associates, Inc., 2020.URLhttps://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Caruana (1997)Rich Caruana.Multitask learning.Machine Learning, 28:41–75, 1997.URL https://api.semanticscholar.org/CorpusID:45998148.
Chronopoulou etal. (2023)Alexandra Chronopoulou, Jonas Pfeiffer, Joshua Maynez, Xinyi Wang, SebastianRuder, and Priyanka Agrawal.Language and task arithmetic with parameter-efficient layers forzero-shot summarization, 2023.
Google etal. (2023)Google, Rohan Anil, AndrewM. Dai, Orhan Firat, Melvin Johnson, DmitryLepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey,Zhifeng Chen, Eric Chu, JonathanH. Clark, LaurentEl Shafey, Yanping Huang,Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, KevinRobinson, Sebastian Ruder, YiTay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang,GustavoHernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha,James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng,Colin Cherry, ChristopherA. Choquette-Choo, Aakanksha Chowdhery, ClémentCrepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz,Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, MarkusFreitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari,Steven Hand, Hadi Hashemi, LeHou, Joshua Howland, Andrea Hu, Jeffrey Hui,Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, WenhaoJia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, KatherineLee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, HyeontaekLim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, AromaMahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, JohnNham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek,Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, ParkerRiley, AlexCastro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, ReneeShelby, Ambrose Slone, Daniel Smilkov, DavidR. So, Daniel Sohn, SimonTokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang,Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, YunhanXu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng,CeZheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu.Palm 2 technical report, 2023.
Hu etal. (2022)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, SheanWang, LuWang, and Weizhu Chen.LoRA: Low-rank adaptation of large language models.In International Conference on Learning Representations, 2022.URL https://openreview.net/forum?id=nZeVKeeFYf9.
Huang etal. (2023)Chengsong Huang, Qian Liu, BillYuchen Lin, Tianyu Pang, Chao Du, and Min Lin.Lorahub: Efficient cross-task generalization via dynamic loracomposition.arXiv preprint arXiv:2307.13269, 2023.
Ilharco etal. (2022)Gabriel Ilharco, MarcoTulio Ribeiro, Mitchell Wortsman, Suchin Gururangan,Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi.Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022.
Jawahar etal. (2023)Ganesh Jawahar, Subhabrata Mukherjee, Xiaodong Liu, YoungJin Kim, MuhammadAbdul-Mageed, Laks Lakshmanan, V.S., AhmedHassan Awadallah, SebastienBubeck, and Jianfeng Gao.AutoMoE: Heterogeneous mixture-of-experts with adaptivecomputation for efficient neural machine translation.In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Findings of the Association for Computational Linguistics: ACL 2023,pp. 9116–9132, Toronto, Canada, July 2023. Association for ComputationalLinguistics.doi: 10.18653/v1/2023.findings-acl.580.URL https://aclanthology.org/2023.findings-acl.580.
Kudugunta etal. (2021)Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin,Minh-Thang Luong, and Orhan Firat.Beyond distillation: Task-level mixture-of-experts for efficientinference.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and ScottWen-tau Yih (eds.), Findings of the Association for ComputationalLinguistics: EMNLP 2021, pp. 3577–3599, Punta Cana, Dominican Republic,November 2021. Association for Computational Linguistics.doi: 10.18653/v1/2021.findings-emnlp.304.URL https://aclanthology.org/2021.findings-emnlp.304.
Lepikhin etal. (2020)Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, YanpingHuang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen.Gshard: Scaling giant models with conditional computation andautomatic sharding, 2020.
Lester etal. (2021)Brian Lester, Rami Al-Rfou, and Noah Constant.The power of scale for parameter-efficient prompt tuning.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and ScottWen-tau Yih (eds.), Proceedings of the 2021 Conference on EmpiricalMethods in Natural Language Processing, Online and Punta Cana, DominicanRepublic, November 2021. Association for Computational Linguistics.URL https://aclanthology.org/2021.emnlp-main.243.
Lin etal. (2021)Zehui Lin, Liwei Wu, Mingxuan Wang, and Lei Li.Learning language specific sub-network for multilingual machinetranslation.In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.),ACL, Online, August 2021. Association for Computational Linguistics.
Ouyang etal. (2022)Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, PamelaMishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, JohnSchulman, Jacob Hilton, Fraser Kelton, LukeE. Miller, Maddie Simens, AmandaAskell, Peter Welinder, PaulFrancis Christiano, Jan Leike, and RyanJ. Lowe.Training language models to follow instructions with human feedback.ArXiv, abs/2203.02155, 2022.URL https://api.semanticscholar.org/CorpusID:246426909.
Pfeiffer etal. (2020)Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder.MAD-X: An Adapter-Based Framework for Multi-TaskCross-Lingual Transfer.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings of the 2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pp. 7654–7673, Online, November 2020.Association for Computational Linguistics.doi: 10.18653/v1/2020.emnlp-main.617.URL https://aclanthology.org/2020.emnlp-main.617.
Pfeiffer etal. (2022)Jonas Pfeiffer, Naman Goyal, XiLin, Xian Li, James Cross, Sebastian Riedel,and Mikel Artetxe.Lifting the curse of multilinguality by pre-training modulartransformers.In Marine Carpuat, Marie-Catherine deMarneffe, and IvanVladimirMezaRuiz (eds.), Proceedings of the 2022 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, pp. 3479–3495, Seattle, United States, July 2022.Association for Computational Linguistics.doi: 10.18653/v1/2022.naacl-main.255.URL https://aclanthology.org/2022.naacl-main.255.
Pfeiffer etal. (2023)Jonas Pfeiffer, Francesco Piccinno, Massimo Nicosia, Xinyi Wang, Machel Reid,and Sebastian Ruder.mmt5: Modular multilingual pre-training solves source languagehallucinations, 2023.
Ruder etal. (2023)Sebastian Ruder, JonathanH. Clark, Alexander Gutkin, Mihir Kale, Min Ma,Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-MichelA. Sarr, XinyiWang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, DanaL.Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, DavidI. Adelani, VeraAxelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, MelvinJohnson, Dmitry Panteleev, and Partha Talukdar.Xtreme-up: A user-centric scarce-data benchmark for under-representedlanguages, 2023.
Shazeer etal. (2017)Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le,Geoffrey Hinton, and Jeff Dean.Outrageously large neural networks: The sparsely-gatedmixture-of-experts layer, 2017.
Soltan etal. (2022)Saleh Soltan, Shankar Ananthakrishnan, Jack G.M. FitzGerald, Rahul Gupta, WaelHamza, Haidar Khan, CharithS. Peris, Stephen Rawls, Andrew Rosenbaum, AnnaRumshisky, Chandan Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv Verma,Gokhan Tur, and Premkumar Natarajan.Alexatm 20b: Few-shot learning using a large-scale multilingualseq2seq model.ArXiv, abs/2208.01448, 2022.URL https://api.semanticscholar.org/CorpusID:251253416.
Vu etal. (2022)TuVu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, and Noah Constant.Overcoming catastrophic forgetting in zero-shot cross-lingualgeneration.In EMNLP, 2022.
Wang etal. (2021)Xinyi Wang, Yulia Tsvetkov, Sebastian Ruder, and Graham Neubig.Efficient test time adapter ensembling for low-resource languagevarieties.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and ScottWen-tau Yih (eds.), EMNLP, Punta Cana, Dominican Republic, November2021.
Wang etal. (2023a)Xinyi Wang, John Wieting, and JonathanH. Clark.Fiat: Fusing learning paradigms with instruction-accelerated tuning,2023a.
Wang etal. (2023b)Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Huan Sun, and YoonKim.Multitask prompt tuning enables parameter-efficient transferlearning.In ICLR, 2023b.
Wang etal. (2018)Zirui Wang, Zihang Dai, Barnabás Póczos, and JaimeG. Carbonell.Characterizing and avoiding negative transfer.2019 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pp. 11285–11294, 2018.URL https://api.semanticscholar.org/CorpusID:53748459.
Zadouri etal. (2023)Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, andSara Hooker.Pushing mixture of experts to the limit: Extremely parameterefficient moe for instruction tuning, 2023.
Zhang etal. (2020)Biao Zhang, Ankur Bapna, Rico Sennrich, and Orhan Firat.Share or not? learning to schedule language-specific capacity formultilingual translation.In ICLR, 2020.
Zhu etal. (2023)Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, HanLu, Canoee Liu, Liangchen Luo, Jindong Chen, etal.Sira: Sparse mixture of low rank adaptation.arXiv preprint arXiv:2311.09179, 2023.

Appendix A Constant parameter sharing hurts FLix

Configuration	QA InLang	QA CrossLang	Semantic parsing	NER
FLix	$89.443\,008\,42$	$85.2$	$45.590\,648\,65$	$84.0$
+ shared parameters	$88.238\,650\,85$	$83.972\,558\,34$	$41.894\,719\,44$	$80.676\,577\,99$
Compute-matched LoRA	$86.6$	$83.1$	$35.4$	$76.3$

Our proposed FLix routes each input to their feature-specific adaptations, based on the dataset features.While constant parameter sharing happens when some features are always active for all datasets in a mixture (e.g., in experiments in §5.2), this is not generally true.In this section, we look into the effects of enforcing constant parameter-sharing in FLix. Specifically, we design a dummy feature that is always active to all datasets in the joint multitask multilingual training mixture used in §5.1.1, which implies constant parameter sharing.We then compare FLix models both without and with this dummy feature, and also with a vanilla LoRA baseline, all under comparable compute budgets.⁷⁷7Specifically, we rearrange the rank sizes of all features such that the sum is $6$ . The dummy feature has rank $=4$ , and the language and task features have rank $=1$ . The compute-matched LoRA baseline does not make use of dataset features, and has rank $=6$ .

In Table5, we can see that adding a shared parameter actually leads to worse performance for FLix with similar computational cost. However, FLix with constantly shared features still outperforms computed-matched LoRA, which does not leverage task or language features at all.

Appendix B Training and evaluation datasets

B.1 Multilingual learning

The (mono-task) multilingual learning experiments described in §5.2 train and evaluate on the same languages. Their language and locale codes are:

Semantic parsing: am, be, bn, fi, ha, hu, ja, pt_br, ru, sw, ta, tr, yo, zu, de_localized, en, de, es, fr, hi, th
In-language QA: ar, bn, fi, id, ko, ru, sw, te, en
Cross-lingual QA: ar, bn, fi, ko, ru, te, as, bho, brx, gbm, gom, gu, hi, hne, kn, mai, ml, mni, mr, mwr, or, pa, ps, sa, ta, ur
NER: am, bm, bbj, ee, ha, ig, rw, lg, luo, mos, ny, pcm, sn, sw, tn, tw, wo, xh, yo, zu

B.2 Multitask learning

The (mono-lingual) multitask learning experiments described in §5.2 train and evaluate on $3$ different tasks, for the $2$ languages we evaluate on respectively. They are:

Swahili (sw): Semantic parsing, NER, in-language QA
Bengali (bn): Semantic parsing, in-language QA, cross-lingual QA

B.3 Joint multitask multilingual learning

B.3.1 Training

The joint multitask multilingual learning experiments in §5.1 use the union of the following datasets:

Semantic parsing: am, be, bn, fi, ha, hu, ja, pt_br, ru, sw, ta, tr, yo, zu, de_localized, en, de, es, fr, hi, th
In-language QA: ar, bn, fi, id, ko, ru, sw, te, en
Cross-lingual QA: ar, bn, fi, ko, ru, te
NER: am, bm, bbj, ee, ha, ig, rw, lg, luo, mos, ny, pcm, sn, sw, tn, tw, wo, xh, yo, zu
Machine translation: id, es, hi, yo, ja, lg, ny, ru, be, ar, de, bn, fr, tr, ig, th, fi, zu, te, ko, sw, xh, hu, ha, sn, ta, am

B.3.2 Multilingual evaluation

We evaluate models trained on the dataset described in §B.3.1 on the $4$ multilingual subsets of the training dataset. The results are reported in the first $4$ columns of Table1.

B.4 Zero-shot generalization

B.4.1 Training

For zero-shot evaluations on cross-lingual QA datasets in §5.1.2, we reuse the joint multitask multilingual training dataset described in §B.3.1.For evaluation of zero-shot unseen combinations on semantic parsing datasets in the same section, we use the union of the following datasets:

Semantic parsing: fi, hu, ja, pt_br, ru, tr, de_localized, en, de, es, fr, hi
In-language QA: ar, bn, fi, id, ko, ru, sw, te, en
Cross-lingual QA: ar, bn, fi, ko, ru, te, as, bho, brx, gbm, gom, gu, hi, hne, kn, mai, ml, mni, mr, mwr, or, pa, ps, sa, ta, ur
NER: am, bm, bbj, ee, ha, ig, rw, lg, luo, mos, ny, pcm, sn, sw, tn, tw, wo, xh, yo, zu
Machine translation: de, mni, mr, ig, id, mwr, ko, fi, ta, bn, ar, es, ja, zu, be, gbm, tr, as, bho, yo, ml, sw, hi, am, bbj, lg, pcm, ny, tw, hu, luo, rw, brx, pt_br, gu, or, gom, te, sn, wo, fr, ps, ha, hne, xh, en, sa, tn, de_localized, ur, bm, ee, th, ru, pa, kn, mai, mos

B.4.2 Zero-shot unseen combinations in cross-lingual QA

In the ‘Zero-shot Unseen Comb: QA CrossLang’ column of Table1 we report the average cross-lingual QA F1 scores of languages that are missing from the cross-lingual QA portion of the joint multitask multilingual training set (§B.3.1), but present in other datasets of the same training set. The language and locale codes of these languages are: hi, ta.

B.4.3 Zero-shot unseen languages in cross-lingual QA

In the ‘Zero-shot Unseen Lang’ column of Table1 we report the average cross-lingual QA F1 scores of languages that are missing from the joint multitask multilingual training set. The language and locale codes of these languages are: bho, brx, gbm, gom, hne, mai, mni, mr, mwr, sa, as, gu, kn, ml, or, pa, ps, ur.

B.4.4 Zero-shot unseen combinations in semantic parsing

In the ‘Zero-shot Unseen Comb: Semantic parsing’ column of Table1 we report the average exact match accuracy scores of languages that are designated as underrepresented in the XTREME-UP dataset, and held out from the training set in this experiment. These languages however appear in the machine translation portion of the training dataset. The language and locale codes of these languages are: am, be, bn, ha, sw, ta, yo, zu, th.

Appendix C Full results of Table2

In addition to the test results reported in Table2, we include validation results as well.

C.1 Multilingual experiments

C.1.1 Cross-lingual QA

Language / Locale ID	Validation	Test
ar	$81.4$	$80.8$
bn	$77.9$	$83.7$
fi	$81.2$	$79.8$
ko	$85.0$	$83.9$
ru	$78.0$	$80.5$
te	$79.8$	$81.5$
as	$78.5$	$78.7$
bho	$74.6$	$75.8$
brx	$54.4$	$47.4$
gbm	$74.1$	$73.4$
gom	$78.3$	$75.8$
gu	$77.9$	$80.2$
hi	$82.2$	$83.5$
hne	$74.1$	$77.1$
kn	$79.2$	$80.1$
mai	$77.6$	$77.0$
ml	$77.0$	$79.3$
mni	$63.4$	$63.1$
mr	$76.1$	$78.1$
mwr	$76.0$	$75.9$
or	$77.4$	$77.7$
pa	$78.0$	$79.8$
ps	$76.5$	$77.4$
sa	$74.9$	$78.3$
ta	$78.4$	$78.5$
ur	$76.8$	$76.7$
Average	$76.5$	$77.1$

Language / Locale ID	Validation	Test
ar	$82.4$	$83.4$
bn	$78.9$	$83.3$
fi	$82.8$	$82.6$
ko	$85.3$	$88.3$
ru	$79.6$	$83.2$
te	$81.8$	$81.5$
as	$80.7$	$78.9$
bho	$79.1$	$75.5$
brx	$54.8$	$47.5$
gbm	$78.4$	$73.3$
gom	$78.6$	$76.8$
gu	$79.5$	$78.8$
hi	$83.0$	$83.7$
hne	$78.8$	$77.9$
kn	$80.0$	$79.9$
mai	$79.0$	$75.6$
ml	$79.7$	$79.0$
mni	$60.4$	$60.5$
mr	$78.0$	$77.4$
mwr	$77.3$	$75.6$
or	$79.2$	$78.1$
pa	$79.1$	$78.6$
ps	$78.0$	$77.9$
sa	$77.0$	$77.2$
ta	$79.6$	$78.7$
ur	$78.5$	$76.1$
Average	$78.1$	$77.3$

Language / Locale ID	Validation	Test
ar	$81.3$	$82.8$
bn	$79.7$	$83.5$
fi	$80.7$	$81.8$
ko	$84.8$	$87.0$
ru	$78.6$	$82.0$
te	$79.8$	$82.2$
as	$79.1$	$78.3$
bho	$77.7$	$75.2$
brx	$58.1$	$51.9$
gbm	$76.9$	$73.2$
gom	$76.9$	$71.5$
gu	$78.2$	$76.6$
hi	$84.2$	$83.2$
hne	$78.1$	$75.5$
kn	$79.7$	$77.7$
mai	$80.3$	$75.8$
ml	$79.7$	$77.7$
mni	$63.4$	$59.5$
mr	$78.5$	$77.0$
mwr	$77.3$	$75.9$
or	$76.3$	$73.9$
pa	$77.6$	$76.4$
ps	$74.8$	$71.7$
sa	$78.8$	$76.6$
ta	$80.3$	$78.5$
ur	$78.9$	$75.5$
Average	$77.7$	$76.2$

Language / Locale ID	Validation	Test
ar	$83.0$	$83.6$
bn	$79.2$	$84.6$
fi	$82.7$	$83.4$
ko	$85.9$	$87.4$
ru	$79.6$	$84.3$
te	$79.9$	$80.3$
as	$81.7$	$77.8$
bho	$78.8$	$76.3$
brx	$56.9$	$50.6$
gbm	$78.1$	$73.6$
gom	$78.4$	$76.3$
gu	$79.7$	$77.1$
hi	$82.2$	$83.6$
hne	$78.1$	$76.3$
kn	$81.2$	$81.0$
mai	$78.7$	$77.6$
ml	$80.4$	$78.7$
mni	$62.5$	$63.4$
mr	$78.3$	$77.7$
mwr	$76.3$	$74.8$
or	$79.5$	$78.1$
pa	$80.2$	$78.2$
ps	$78.6$	$77.0$
sa	$77.9$	$78.0$
ta	$80.8$	$78.8$
ur	$79.5$	$77.6$
Average	$78.4$	$77.6$

C.1.2 In-language QA

Language / Locale ID	Validation	Test
ar	$87.3$	$88.5$
bn	$85.8$	$86.4$
fi	$89.5$	$90.2$
id	$85.3$	$86.0$
ko	$81.7$	$84.6$
ru	$87.0$	$85.1$
sw	$84.3$	$87.5$
te	$90.4$	$92.4$
en	$85.3$	$87.1$
Average	$86.3$	$87.5$

Language / Locale ID	Validation	Test
ar	$86.4$	$87.6$
bn	$89.0$	$85.9$
fi	$88.9$	$89.1$
id	$87.8$	$88.7$
ko	$83.4$	$86.2$
ru	$86.6$	$87.8$
sw	$86.6$	$89.6$
te	$91.1$	$93.1$
en	$85.3$	$88.3$
Average	$87.2$	$88.5$

Language / Locale ID	Validation	Test
ar	$86.4$	$87.7$
bn	$90.7$	$87.7$
fi	$89.6$	$89.7$
id	$87.9$	$88.7$
ko	$81.6$	$87.5$
ru	$86.2$	$88.7$
sw	$87.2$	$90.5$
te	$91.0$	$92.9$
en	$84.8$	$87.9$
Average	$87.3$	$89.0$

Language / Locale ID	Validation	Test
ar	$87.2$	$89.1$
bn	$90.2$	$89.1$
fi	$90.1$	$89.7$
id	$87.8$	$88.9$
ko	$82.5$	$85.9$
ru	$87.8$	$89.1$
sw	$86.5$	$90.2$
te	$91.0$	$93.5$
en	$87.4$	$88.9$
Average	$87.8$	$89.4$

C.1.3 Semantic parsing

Language / Locale ID	Validation	Test
am	$25.5$	$14.7$
be	$34.7$	$25.8$
bn	$38.5$	$25.7$
fi	$33.9$	$25.0$
ha	$28.9$	$22.5$
hu	$31.0$	$23.4$
ja	$36.8$	$25.9$
pt-br	$36.8$	$25.5$
ru	$40.6$	$28.0$
sw	$34.7$	$23.1$
ta	$31.4$	$28.8$
tr	$40.2$	$24.4$
yo	$20.5$	$13.1$
zu	$25.1$	$15.2$
de-localized	$31.2$	$21.4$
en	$37.7$	$25.3$
de	$33.9$	$25.5$
es	$35.3$	$25.7$
fr	$30.1$	$21.8$
hi	$35.5$	$14.0$
th	$36.5$	$18.3$
Average	$33.3$	$22.5$

Language / Locale ID	Validation	Test
am	$41.0$	$31.8$
be	$52.3$	$43.3$
bn	$51.0$	$40.6$
fi	$53.1$	$45.2$
ha	$46.0$	$35.9$
hu	$50.6$	$38.5$
ja	$52.3$	$38.4$
pt-br	$57.7$	$45.8$
ru	$60.7$	$45.4$
sw	$51.0$	$39.9$
ta	$47.3$	$39.5$
tr	$53.1$	$39.9$
yo	$39.3$	$28.8$
zu	$41.0$	$31.7$
de-localized	$57.4$	$44.1$
en	$54.8$	$45.3$
de	$54.8$	$43.0$
es	$57.8$	$45.1$
fr	$62.8$	$48.9$
hi	$54.8$	$37.0$
th	$55.3$	$42.4$
Average	$52.1$	$40.5$

Language / Locale ID	Validation	Test
am	$48.5$	$36.0$
be	$59.0$	$50.3$
bn	$60.7$	$47.4$
fi	$62.3$	$51.9$
ha	$53.6$	$45.2$
hu	$60.3$	$48.4$
ja	$56.1$	$44.1$
pt-br	$61.9$	$52.9$
ru	$64.4$	$51.1$
sw	$51.9$	$43.8$
ta	$57.3$	$45.9$
tr	$59.4$	$45.6$
yo	$43.5$	$36.4$
zu	$44.8$	$37.6$
de-localized	$65.8$	$50.7$
en	$64.0$	$53.3$
de	$58.2$	$49.2$
es	$63.0$	$52.8$
fr	$65.8$	$53.9$
hi	$65.8$	$44.8$
th	$65.3$	$50.7$
Average	$58.6$	$47.2$

Language / Locale ID	Validation	Test
am	$47.7$	$37.4$
be	$58.6$	$47.4$
bn	$54.4$	$46.7$
fi	$59.4$	$50.3$
ha	$49.4$	$41.3$
hu	$56.5$	$46.3$
ja	$56.9$	$44.3$
pt-br	$60.7$	$48.8$
ru	$60.7$	$48.4$
sw	$58.6$	$42.7$
ta	$55.6$	$46.5$
tr	$58.2$	$44.6$
yo	$44.4$	$34.0$
zu	$37.2$	$37.6$
de-localized	$61.9$	$44.4$
en	$64.9$	$49.9$
de	$54.4$	$44.1$
es	$61.3$	$50.6$
fr	$62.8$	$50.3$
hi	$62.6$	$42.5$
th	$59.4$	$47.5$
Average	$56.4$	$45.0$

C.1.4 Named-entity recognition (NER)

Language / Locale ID	Validation	Test
am	$74.8$	$72.8$
bm	$73.4$	$68.9$
bbj	$60.2$	$62.2$
ee	$84.9$	$82.4$
ha	$89.6$	$86.5$
ig	$82.0$	$82.2$
rw	$77.9$	$76.1$
lg	$87.0$	$83.8$
luo	$66.1$	$71.4$
mos	$63.8$	$65.7$
ny	$86.8$	$88.3$
pcm	$82.1$	$81.6$
sn	$89.3$	$88.6$
sw	$89.6$	$88.8$
tn	$77.1$	$84.4$
tw	$72.9$	$74.0$
wo	$81.5$	$76.5$
xh	$83.7$	$81.1$
yo	$79.1$	$80.7$
zu	$79.2$	$82.2$
Average	$79.1$	$78.9$

Language / Locale ID	Validation	Test
am	$77.3$	$73.7$
bm	$72.4$	$67.8$
bbj	$54.5$	$66.1$
ee	$82.9$	$82.6$
ha	$90.9$	$90.0$
ig	$84.4$	$85.3$
rw	$80.0$	$77.4$
lg	$87.5$	$86.2$
luo	$77.2$	$76.3$
mos	$67.6$	$66.0$
ny	$87.8$	$89.2$
pcm	$82.8$	$86.3$
sn	$90.4$	$90.2$
sw	$90.7$	$90.9$
tn	$80.8$	$86.6$
tw	$76.7$	$79.1$
wo	$80.8$	$78.5$
xh	$83.5$	$83.8$
yo	$80.3$	$83.0$
zu	$84.7$	$87.3$
Average	$80.6$	$81.3$

Language / Locale ID	Validation	Test
am	$79.1$	$77.4$
bm	$79.9$	$73.8$
bbj	$65.7$	$68.3$
ee	$88.0$	$86.7$
ha	$93.6$	$91.3$
ig	$86.8$	$85.2$
rw	$82.1$	$78.0$
lg	$89.2$	$86.3$
luo	$76.7$	$77.1$
mos	$72.1$	$70.6$
ny	$88.8$	$89.2$
pcm	$87.4$	$88.2$
sn	$92.7$	$92.6$
sw	$91.7$	$90.8$
tn	$82.9$	$88.6$
tw	$81.5$	$81.8$
wo	$81.1$	$80.8$
xh	$85.4$	$86.2$
yo	$82.2$	$84.7$
zu	$85.7$	$87.3$
Average	$83.6$	$83.2$

Language / Locale ID	Validation	Test
am	$81.8$	$80.0$
bm	$79.2$	$73.2$
bbj	$66.5$	$72.1$
ee	$87.2$	$86.1$
ha	$93.6$	$91.1$
ig	$85.9$	$86.5$
rw	$83.5$	$80.0$
lg	$90.2$	$88.1$
luo	$78.6$	$80.4$
mos	$73.0$	$73.7$
ny	$89.5$	$91.8$
pcm	$87.2$	$87.9$
sn	$93.2$	$93.2$
sw	$91.5$	$91.1$
tn	$83.1$	$88.1$
tw	$81.2$	$79.6$
wo	$84.7$	$81.3$
xh	$86.7$	$87.0$
yo	$83.5$	$86.0$
zu	$86.5$	$89.0$
Average	$84.3$	$84.3$

C.2 Multitask experiments

Task	Validation	Test
SemParse	$34.7$	$23.0$
QA-InLang	$84.2$	$87.3$
NER	$88.6$	$88.7$
Average	$69.2$	$66.3$

Task	Validation	Test
SemParse	$37.7$	$23.0$
QA-InLang	$83.9$	$86.4$
NER	$88.6$	$88.7$
Average	$70.1$	$66.0$

Task	Validation	Test
SemParse	$38.5$	$26.4$
QA-InLang	$82.5$	$85.2$
NER	$89.7$	$88.8$
Average	$70.3$	$66.8$

Task	Validation	Test
SemParse	$47.7$	$32.7$
QA-InLang	$85.2$	$87.5$
NER	$91.0$	$90.3$
Average	$74.6$	$70.2$

Task	Validation	Test
SemParse	$37.2$	$25.7$
QA-InLang	$87.6$	$86.7$
QA-CrossLang	$78.6$	$83.2$
Average	$67.8$	$65.2$

Task	Validation	Test
SemParse	$36.4$	$24.3$
QA-InLang	$88.4$	$85.4$
QA-CrossLang	$76.6$	$81.9$
Average	$67.2$	$63.8$

Task	Validation	Test
SemParse	$44.4$	$29.4$
QA-InLang	$89.1$	$83.6$
QA-CrossLang	$76.1$	$84.4$
Average	$69.8$	$65.8$

Task	Validation	Test
SemParse	$52.7$	$37.0$
QA-InLang	$90.0$	$85.9$
QA-CrossLang	$77.3$	$84.1$
Average	$73.3$	$69.0$

Appendix D Full results of Table1

As in AppendixC, we include validation results along with test results.

D.1 Supervised learning results

D.1.1 Cross-lingual QA

Language / Locale ID	Validation	Test
ar	$83.6$	$82.3$
bn	$79.6$	$82.9$
fi	$81.4$	$82.2$
ko	$85.0$	$85.2$
ru	$79.8$	$81.7$
te	$80.3$	$84.4$
Average	$81.6$	$83.1$

Language / Locale ID	Validation	Test
ar	$82.1$	$83.6$
bn	$79.3$	$84.5$
fi	$81.8$	$81.3$
ko	$84.5$	$86.0$
ru	$79.5$	$82.7$
te	$81.0$	$81.0$
Average	$81.4$	$83.2$

Language / Locale ID	Validation	Test
ar	$85.6$	$84.3$
bn	$82.2$	$86.2$
fi	$82.6$	$83.2$
ko	$85.5$	$88.2$
ru	$80.5$	$84.0$
te	$82.1$	$85.0$
Average	$83.1$	$85.2$

D.1.2 In-language QA

Language / Locale ID	Validation	Test
ar	$83.8$	$84.9$
bn	$88.1$	$83.9$
fi	$86.7$	$87.4$
id	$86.5$	$88.1$
ko	$80.9$	$85.0$
ru	$82.6$	$84.1$
sw	$85.2$	$88.2$
te	$89.2$	$91.7$
en	$83.3$	$85.9$
Average	$85.2$	$86.6$

Language / Locale ID	Validation	Test
ar	$85.2$	$86.9$
bn	$90.8$	$85.8$
fi	$87.0$	$87.5$
id	$85.4$	$87.1$
ko	$80.4$	$85.3$
ru	$84.9$	$86.2$
sw	$84.8$	$88.7$
te	$89.4$	$91.7$
en	$84.5$	$86.7$
Average	$85.8$	$87.3$

Language / Locale ID	Validation	Test
ar	$86.4$	$88.7$
bn	$90.4$	$91.1$
fi	$90.0$	$89.5$
id	$88.6$	$89.5$
ko	$83.2$	$86.1$
ru	$85.4$	$88.5$
sw	$87.4$	$90.1$
te	$91.5$	$93.6$
en	$85.0$	$87.9$
Average	$87.5$	$89.4$

D.1.3 Semantic parsing

Language / Locale ID	Validation	Test
am	$36.0$	$24.5$
be	$47.7$	$38.6$
bn	$51.5$	$34.3$
fi	$53.6$	$40.1$
ha	$39.3$	$30.0$
hu	$50.2$	$36.9$
ja	$46.4$	$33.9$
pt-br	$53.6$	$41.1$
ru	$54.0$	$40.2$
sw	$46.4$	$34.3$
ta	$43.5$	$31.4$
tr	$45.2$	$33.7$
yo	$29.7$	$24.0$
zu	$33.1$	$29.3$
de-localized	$56.4$	$40.3$
en	$50.6$	$39.9$
de	$47.7$	$38.0$
es	$54.9$	$42.2$
fr	$55.6$	$41.9$
hi	$49.7$	$31.4$
th	$54.7$	$37.5$
Average	$47.6$	$35.4$

Language / Locale ID	Validation	Test
am	$43.1$	$35.5$
be	$58.6$	$50.4$
bn	$59.0$	$47.4$
fi	$62.3$	$53.0$
ha	$51.0$	$42.1$
hu	$58.2$	$47.2$
ja	$59.0$	$45.9$
pt-br	$64.0$	$51.7$
ru	$63.2$	$51.3$
sw	$55.6$	$44.1$
ta	$54.0$	$47.8$
tr	$58.6$	$46.6$
yo	$43.5$	$34.3$
zu	$46.9$	$39.7$
de-localized	$61.9$	$49.8$
en	$61.5$	$53.1$
de	$56.1$	$48.6$
es	$64.7$	$51.7$
fr	$62.8$	$53.7$
hi	$66.5$	$46.4$
th	$63.5$	$49.7$
Average	$57.8$	$47.1$

Language / Locale ID	Validation	Test
am	$46.9$	$38.3$
be	$59.8$	$48.8$
bn	$58.2$	$47.1$
fi	$59.8$	$50.9$
ha	$50.6$	$41.0$
hu	$55.2$	$44.2$
ja	$54.0$	$45.5$
pt-br	$61.5$	$47.6$
ru	$62.8$	$50.6$
sw	$55.2$	$42.6$
ta	$53.1$	$45.7$
tr	$57.7$	$45.4$
yo	$43.5$	$34.1$
zu	$41.4$	$37.6$
de-localized	$60.4$	$45.6$
en	$61.1$	$50.7$
de	$57.7$	$45.5$
es	$60.7$	$50.4$
fr	$65.8$	$52.5$
hi	$63.9$	$45.2$
th	$61.8$	$48.2$
Average	$56.7$	$45.6$

D.1.4 NER

Language / Locale ID	Validation	Test
am	$68.8$	$70.8$
bm	$67.3$	$63.5$
bbj	$52.4$	$61.2$
ee	$80.5$	$78.6$
ha	$87.5$	$84.8$
ig	$78.2$	$82.1$
rw	$77.2$	$71.6$
lg	$85.4$	$82.4$
luo	$61.7$	$66.9$
mos	$62.9$	$61.6$
ny	$83.4$	$85.5$
pcm	$79.9$	$81.3$
sn	$84.1$	$84.4$
sw	$87.2$	$87.1$
tn	$77.8$	$84.3$
tw	$74.0$	$74.8$
wo	$75.2$	$73.9$
xh	$73.3$	$77.2$
yo	$71.4$	$75.0$
zu	$75.4$	$78.7$
Average	$75.2$	$76.3$

Language / Locale ID	Validation	Test
am	$74.4$	$76.7$
bm	$75.1$	$71.0$
bbj	$61.7$	$66.1$
ee	$85.1$	$83.5$
ha	$92.1$	$90.5$
ig	$85.4$	$86.1$
rw	$81.8$	$76.8$
lg	$87.3$	$85.3$
luo	$77.1$	$75.9$
mos	$66.7$	$66.4$
ny	$87.1$	$87.9$
pcm	$85.4$	$86.9$
sn	$91.8$	$90.6$
sw	$90.7$	$90.1$
tn	$82.0$	$87.6$
tw	$78.4$	$77.9$
wo	$79.4$	$78.9$
xh	$82.4$	$85.5$
yo	$79.5$	$81.2$
zu	$82.9$	$86.1$
Average	$81.3$	$81.6$

Language / Locale ID	Validation	Test
am	$81.1$	$79.4$
bm	$80.4$	$73.7$
bbj	$63.6$	$70.5$
ee	$87.0$	$86.1$
ha	$92.7$	$91.2$
ig	$85.4$	$86.9$
rw	$83.9$	$80.2$
lg	$89.1$	$87.2$
luo	$74.7$	$78.9$
mos	$71.2$	$72.5$
ny	$89.4$	$90.4$
pcm	$86.8$	$88.9$
sn	$91.9$	$92.7$
sw	$91.6$	$91.4$
tn	$83.2$	$88.9$
tw	$79.4$	$78.2$
wo	$83.3$	$81.9$
xh	$85.5$	$86.0$
yo	$83.0$	$85.8$
zu	$84.9$	$88.7$
Average	$83.4$	$84.0$

D.2 Zero-shot results

D.2.1 Zero-shot unseen combinations: cross-lingual QA

Language / Locale ID	Validation	Test
hi	$82.1$	$83.2$
ta	$79.3$	$80.1$
Average	$80.7$	$81.6$

Language / Locale ID	Validation	Test
hi	$84.4$	$84.4$
ta	$80.7$	$79.8$
Average	$82.5$	$82.1$

Language / Locale ID	Validation	Test
hi	$84.9$	$85.3$
ta	$81.9$	$80.4$
Average	$83.4$	$82.8$

D.2.2 Zero-shot unseen combinations: semantic parsing

Language / Locale ID	Validation	Test
am	$13.8$	$10.2$
be	$35.6$	$30.1$
bn	$31.0$	$21.6$
ha	$20.1$	$16.2$
sw	$34.3$	$22.0$
ta	$20.5$	$11.4$
yo	$7.5$	$8.9$
zu	$17.2$	$16.0$
th	$37.6$	$23.0$
Average	$24.2$	$17.7$

Language / Locale ID	Validation	Test
am	$24.7$	$17.6$
be	$54.0$	$42.8$
bn	$45.2$	$35.6$
ha	$30.5$	$25.5$
sw	$43.1$	$32.2$
ta	$30.5$	$23.1$
yo	$21.3$	$16.1$
zu	$27.6$	$24.8$
th	$55.9$	$38.4$
average	$37.0$	$28.4$

Language / Locale ID	Validation	Test
am	$46.9$	$38.3$
be	$59.8$	$48.8$
bn	$58.2$	$47.1$
ha	$50.6$	$41.0$
sw	$55.2$	$42.6$
ta	$53.1$	$45.7$
yo	$43.5$	$34.1$
zu	$41.4$	$37.6$
th	$61.8$	$48.2$
Average	$52.3$	$42.6$

D.2.3 Zero-shot unseen languages

Language / Locale ID	Validation	Test
as	$80.7$	$79.6$
bho	$78.9$	$76.0$
brx	$45.7$	$41.0$
gbm	$74.6$	$73.1$
gom	$77.8$	$75.3$
gu	$80.9$	$79.8$
hne	$77.1$	$78.5$
kn	$80.9$	$80.5$
mai	$79.3$	$76.2$
ml	$81.0$	$81.4$
mni	$52.3$	$52.9$
mr	$78.7$	$78.7$
mwr	$80.4$	$76.9$
or	$77.8$	$77.5$
pa	$79.3$	$80.4$
ps	$77.0$	$78.2$
sa	$78.1$	$80.5$
ur	$79.3$	$78.5$
Average	$75.5$	$74.7$

Language / Locale ID	Validation	Test
as	$81.3$	$79.8$
bho	$79.0$	$77.0$
brx	$46.5$	$41.2$
gbm	$73.4$	$72.5$
gom	$77.7$	$74.7$
gu	$81.9$	$80.0$
hne	$78.1$	$77.2$
kn	$82.7$	$80.1$
mai	$81.3$	$77.5$
ml	$81.6$	$80.9$
mni	$48.8$	$51.1$
mr	$80.0$	$80.3$
mwr	$77.7$	$75.8$
or	$77.8$	$79.2$
pa	$80.0$	$79.8$
ps	$77.0$	$78.5$
sa	$77.5$	$78.9$
ur	$78.3$	$77.7$
Average	$75.6$	$74.6$

Language / Locale ID	Validation	Test
as	$82.6$	$82.5$
bho	$78.7$	$78.7$
brx	$50.3$	$47.0$
gbm	$77.5$	$75.4$
gom	$80.3$	$78.6$
gu	$82.7$	$82.6$
hne	$79.0$	$79.8$
kn	$82.3$	$82.5$
mai	$81.8$	$79.9$
ml	$82.1$	$81.0$
mni	$56.2$	$59.0$
mr	$79.7$	$80.7$
mwr	$79.5$	$77.9$
or	$81.4$	$81.1$
pa	$79.9$	$81.7$
ps	$78.9$	$80.6$
sa	$80.1$	$81.6$
ur	$80.2$	$79.9$
Average	$77.4$	$77.2$