The data linked from this page is made available under the CreativeCommons BY-SA (Attribution-ShareAlike) license unless specified otherwise in the documentation. By downloading the data, you acknowledge the terms and conditions of the license. If you use the resources, please cite the papers indicated in the respective pieces of documentation.


  • Instantiation / hypernymy datasets (Boleda, Gupta, Padó, EACL 2017). Download it here.
  • The LAMBADA dataset (Paperno, Kruszewski, Lazaridou, Pham, Bernardi, Pezzelle, Baroni, Boleda, and Fernández, ACL 2016) for word prediction requiring a broad discourse context. Download it from its webpage
  • Intensional / non-intensional modification dataset (Boleda, Baroni, Pham, McNally, IWCS 2013). Download it here.
  • Regular polysemy evaluation dataset (Boleda, Pado, Utt, *SEM 2012). Download it here.
  • Intersective, subsective, and intensional adjective-noun phrases (Boleda, Vecchi, Cornudella, McNally, EMNLP 2012). Download it here.
  • Datasets for color terms (Bruni, Boleda, Baroni, Tran, ACL 2012). Download it here.
  • The Database of Catalan Adjectives (DCA) (Sanromà and Boleda, LREC 2012). The DCA consists of 2,296 Catalan adjective lemmata enriched with morphological, syntactic and semantic information. Download it from its page at the ACL data and code repository.


  • Wikicorpus (Reese, Boleda, Cuadros, Padró, Rigau, LREC 2012). The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words. For more information and download options, visit the project's page. Available via a web interface here.
  • CUCWEB (Boleda, Bott, Castillo, Meza, Badia, López, EACL WaC workshop 2006). CUCWeb is a 166 million word corpus for Catalan built by crawling the Web. If you are interested in obtaining it, get in touch with me.


  • POS-Tagger for Old Spanish (Sánchez-Marco, Boleda, Padró, LaTeCH 2011). Part of the open source suite of language analyzers FreeLing.
  • CatCG (Alsina, Badia, Boleda, Bott, Gil, Quixal, Valentín, LREC 2002). CatCG is a Constraint Grammar tagger and shallow parser for Catalan. If you are looking for a freely available NLP tool to process Catalan, consider FreeLing.