Nick Halstead
9th December 2010 0 Comments

We have updated our language detection service within DataSift, as we found that our previous version was unable to identify the language of interactions as reliably and efficiently as we had originally hoped it would.  If you’re wondering where to find it, go to Create Stream or Edit Stream and it’s in the CSDL Language Help area at the bottom of the list.

Thanks to the improved efficiency we have added support for ten more languages which can be found in the table below. If there is a language that we do not currently support, please point us towards some sample text of your chosen language (200+KB of a book, or web article’s. however dictionary’s and lists should be avoided) to train the detector on.  You can do this by raising a suggestion here and we will try and include it in our next release.

For those interested in the deep technical detail of what has changed here’s the low down.  Our system still uses an n-gram based approach to language detection but is now using fixed length trigram’s (blocks of 3 characters) instead of variable length n-gram’s. This improves processing efficiency as we no longer loop over the interaction text to generate for each length of n-gram. Instead, we now generate all the trigrams in a single pass. Also we are generating the trigrams that include word boundaries, rather than looking at each word in isolation.

Language language.tag code (ISO 639-1)
Afrikaans af
Bulgarian bg (new)
Czech cs (new)
Danish da
German de
Greek el (new)
English en
Spanish es
Finnish fi
French fr
Hebrew he (new)
Hungarian hu
Icelandic is (new)
Italian it
Japanese ja
Latin la (new)
Dutch nl
Norwegian no
Polish pl (new)
Portuguese pt
Romanian ro (new)
Russian ru (new)
Swedish sv
Tagalog tl (new)
Chinese zh
