User Tools

Site Tools


wiki:ponyland:corpora

Corpora

Ponyland houses various corpora useful for language and speech research, among which CGN, SoNaR and BNC. They are all available at /vol/bigdata/corpora. This the full list, automatically updated daily:

acorns
ANP
BASILEX
BNC
buckeye_modified
CALLFRIEND
CELEX
CGN2
CMC
CommonCrawl
Cornetto_2.0
COW
DPC
du
DutchSemCor
dutchsemcor.tar
elex1.1
EMEA
europarl3
europarl7
fisher
FrogData
GOOGLE
Google-Web1T-5gram
Google-Web1T-5gram-10-European-Languages
IWSLT06
IWSLT2012-MT-TEDTALKS
JASMIN
JRC-Acquis-2.2
JRC-Acquis-3.0
JRC-moses
LDC
MERIT
MultiUN
NewsCommentary
NewsCrawl
NIST
OpenSubtitles2011
OpenSubtitles2012
OPUS
PASCAL
SONAR1
SoNaR500
SoNaR500.Curated
swb1
swb2p1
swb2p3
swbcellp1
swbcellp2
Taaltest
TIME
tweet_etks
tweetsForAZGames
TWENTE
UMBC
ValkuilData
Wikicorpus
WikiLeaks
WSJ0
Yandex

We also have a folder /vol/bigdata/datasets for smaller, more specific, personally collected 'corpora'.

wiki/ponyland/corpora.txt · Last modified: 2016/03/18 11:54 by Wessel Stoop