Victoria Meyer Combat Transcription Errors With Linguistic Identity Matching
Jinfo Blog

8th October 2013

By Victoria Meyer

Abstract

Accurate linguistic identity matching is essential for compliance professionals. Victoria Meyer examines the main sources of variations in names and highlights what to look for when choosing a compliance screening system. She focuses on transcription variants, where phonetics can create many different transcription variants. The ideas presented here are taken from "Linguistic Identity Matching" by Lisbach and Meyer (2013).

Item

Compliance in ContextLinguistic identity matching is the modern approach to maximising the recall and precision of identity search systems. It has largely dealt with the problems encountered using earlier fuzzy-matching techniques that were widely-recognised to provide extremely poor precision without appropriately addressing search risks.

This article looks at the sources of variation in names, and goes on to consider the most effective ways to get round these in a search. Further details are available in “Linguistic Identity Matching” (Lisbach and Meyer, 2013).

The main sources of variation in names are:

  • Transcription variants
  • Phonetic misspellings
  • Derivative name forms
  • Typos and OCR errors.

This feature focuses on transcription, which can be considered the Achilles heel of many matching systems.

Transcription Variants

Transcription is the process of converting names from one language or script to another, such that native speakers of the two languages would pronounce the names the same. The phonetics of the two languages involved often mean that many different transcription variants are possible.

The name Ельцин is a good example of this. The most common English transcription, Yeltsin, sounds the same as the common German transcription, Jelzin, due to the softer sounding German “J” and harder German “z”. The following examples show how widely transcriptions from Cyrillic can vary.

Original Russian Cyrillic

English Transcription

German Transcription

French Transcription

Никита Сергеевич Хрущёв

Nikita Sergeyevich Khrushchev

Nikita Sergejewitsch Chruschtschow

Nikita Sergeïevitch Khrouchtchev

Михаил Сергеевич Горбачёв

Mikhail Sergeyevich Gorbachev

Michail Sergejewitsch Gorbatschow

Mikhaïl Sergueïevitch Gorbatchev

The Arabic example:

Transcriptions from Arabic vary even more widely, partly due to the number of Arabic dialects in use. Word boundaries often change in transcription, for example,‎الرحمن عبد can be transcribed in one part as in Abdourrahmane, in two parts as in Abdul Rahman, or in three parts as Abd ur Rahman.

The following examples show how varied transcriptions from Arabic can be.

محمد

Muhammad, Mohamed, Muhammed, Mhamed, Mouhammad, Mukhamed

جمعة

Jumaa, Joumah, Djum’a, Dschum’aa, Djomaah, Joum’aa

سليمان

Suleyman, Soliman, Souleymane, Suleiman, Soulaiman

الدين عز

Izzadin, Ezedine, Ezzedin, Izz al-Din, Ez aldeen, Izzuddine

عبيد

Ubaid, ‘Ubeid, ‘Obayd, Oubeyd, Abid, Abeed, ‘Abeed

القذافي

Qaddafi, Kadhafi, Gadhafi, Al Quathafi, Khadafy, Kazzafi, Gheddafi, El Quathafi

The Chinese example:

The number of different transcription standards in use around the Chinese-speaking world, and the fact that many Japanese, Korean and Vietnamese names are also written in Chinese characters, mean that transcriptions of names written in Chinese can vary widely.

Chinese

Hanyu Pinyin

Jyutping

Other variants used in PR China, Taiwan and Singapore

Other variants used in Japan, Korea and Vietnam

Li

Lei

La, Lee, Lii

Lee, Ly, Rhee, Ri

Wang

Wong

Heng, Ong, Vong, Wong, Yu, Yuh

O, Vuong, Wang

Huang

Wong

Hwang, Wang, Vong, Ng, Eng, Wee, Oei, Ooi, Bong, Uy, Ung

Hoang, Huynh, Hwang, Ko

Wu

Ng

Eng, Go, Goh, Gouw, Ngo, Ung, Woo

Go, Kure, Ngo, Oh

Solving the Transcription Issue

Linguistic similarity keys are the only practical approach to the transcription issue. The approach is similar to the old phonetic matching algorithms, but uses many more rules and, most importantly, uses different rule sets for each combination of source and target language.

Selecting a system that will accurately address all the linguistic search issues is an extremely complicated task, made more so by the fact that many market-leading compliance tools do not use the most effective search techniques. As a result, it is vital that any investment in compliance technology is preceded by thorough testing of the search component.

FreePint Subscribers can read the full version of Victoria's article "Linguistic Identity Matching".


Editor's Note: Compliance in Context

This article is part of the FreePint Topic Series: Compliance in Context, which runs from September to October 2013. Register your interest, and you'll get pre-notification of when registration opens for any webinars in this series, as well as a free copy of the FreePint Report: Buyer's Guide on Regulatory Compliance when we publish in October.

Platinum Sponsors:

Platinum Sponsor - Bureau van Dijk

Bureau van Dijk



Platinum Sponsor - LexisNexis     

LexisNexis



 

« Blog