Jinfo: Combat Transcription Errors With Linguistic Identity Matching

Combat Transcription Errors With Linguistic Identity Matching
Jinfo Blog

8th October 2013

Abstract

Accurate linguistic identity matching is essential for compliance professionals. Victoria Meyer examines the main sources of variations in names and highlights what to look for when choosing a compliance screening system. She focuses on transcription variants, where phonetics can create many different transcription variants. The ideas presented here are taken from "Linguistic Identity Matching" by Lisbach and Meyer (2013).

Item

Linguistic identity matching is the modern approach to maximising the recall and precision of identity search systems. It has largely dealt with the problems encountered using earlier fuzzy-matching techniques that were widely-recognised to provide extremely poor precision without appropriately addressing search risks.

This article looks at the sources of variation in names, and goes on to consider the most effective ways to get round these in a search. Further details are available in “Linguistic Identity Matching” (Lisbach and Meyer, 2013).

The main sources of variation in names are:

Transcription variants
Phonetic misspellings
Derivative name forms
Typos and OCR errors.

This feature focuses on transcription, which can be considered the Achilles heel of many matching systems.

Transcription Variants

Transcription is the process of converting names from one language or script to another, such that native speakers of the two languages would pronounce the names the same. The phonetics of the two languages involved often mean that many different transcription variants are possible.

The name Ельцин is a good example of this. The most common English transcription, Yeltsin, sounds the same as the common German transcription, Jelzin, due to the softer sounding German “J” and harder German “z”. The following examples show how widely transcriptions from Cyrillic can vary.

Original Russian Cyrillic	English Transcription	German Transcription	French Transcription
Никита Сергеевич Хрущёв	Nikita Sergeyevich Khrushchev	Nikita Sergejewitsch Chruschtschow	Nikita Sergeïevitch Khrouchtchev
Михаил Сергеевич Горбачёв	Mikhail Sergeyevich Gorbachev	Michail Sergejewitsch Gorbatschow	Mikhaïl Sergueïevitch Gorbatchev

The Arabic example:

Transcriptions from Arabic vary even more widely, partly due to the number of Arabic dialects in use. Word boundaries often change in transcription, for example,‎الرحمن عبد can be transcribed in one part as in Abdourrahmane, in two parts as in Abdul Rahman, or in three parts as Abd ur Rahman.

The following examples show how varied transcriptions from Arabic can be.

محمد	Muhammad, Mohamed, Muhammed, Mhamed, Mouhammad, Mukhamed
جمعة	Jumaa, Joumah, Djum’a, Dschum’aa, Djomaah, Joum’aa
سليمان	Suleyman, Soliman, Souleymane, Suleiman, Soulaiman
الدين عز	Izzadin, Ezedine, Ezzedin, Izz al-Din, Ez aldeen, Izzuddine
عبيد	Ubaid, ‘Ubeid, ‘Obayd, Oubeyd, Abid, Abeed, ‘Abeed
القذافي	Qaddafi, Kadhafi, Gadhafi, Al Quathafi, Khadafy, Kazzafi, Gheddafi, El Quathafi

The Chinese example:

The number of different transcription standards in use around the Chinese-speaking world, and the fact that many Japanese, Korean and Vietnamese names are also written in Chinese characters, mean that transcriptions of names written in Chinese can vary widely.

Chinese	Hanyu Pinyin	Jyutping	Other variants used in PR China, Taiwan and Singapore	Other variants used in Japan, Korea and Vietnam
李	Li	Lei	La, Lee, Lii	Lee, Ly, Rhee, Ri
王	Wang	Wong	Heng, Ong, Vong, Wong, Yu, Yuh	O, Vuong, Wang
黃	Huang	Wong	Hwang, Wang, Vong, Ng, Eng, Wee, Oei, Ooi, Bong, Uy, Ung	Hoang, Huynh, Hwang, Ko
吳	Wu	Ng	Eng, Go, Goh, Gouw, Ngo, Ung, Woo	Go, Kure, Ngo, Oh

Solving the Transcription Issue

Linguistic similarity keys are the only practical approach to the transcription issue. The approach is similar to the old phonetic matching algorithms, but uses many more rules and, most importantly, uses different rule sets for each combination of source and target language.

Selecting a system that will accurately address all the linguistic search issues is an extremely complicated task, made more so by the fact that many market-leading compliance tools do not use the most effective search techniques. As a result, it is vital that any investment in compliance technology is preceded by thorough testing of the search component.

FreePint Subscribers can read the full version of Victoria's article "Linguistic Identity Matching".

Editor's Note: Compliance in Context

This article is part of the FreePint Topic Series: Compliance in Context, which runs from September to October 2013. Register your interest, and you'll get pre-notification of when registration opens for any webinars in this series, as well as a free copy of the FreePint Report: Buyer's Guide on Regulatory Compliance when we publish in October.

Platinum Sponsors:

Platinum Sponsor - Bureau van Dijk