Combat Transcription Errors With Linguistic Identity Matching
Jinfo Blog
8th October 2013
Abstract
Accurate linguistic identity matching is essential for compliance professionals. Victoria Meyer examines the main sources of variations in names and highlights what to look for when choosing a compliance screening system. She focuses on transcription variants, where phonetics can create many different transcription variants. The ideas presented here are taken from "Linguistic Identity Matching" by Lisbach and Meyer (2013).
Item
Linguistic identity matching is the modern approach to maximising the recall and precision of identity search systems. It has largely dealt with the problems encountered using earlier fuzzy-matching techniques that were widely-recognised to provide extremely poor precision without appropriately addressing search risks.
This article looks at the sources of variation in names, and goes on to consider the most effective ways to get round these in a search. Further details are available in “Linguistic Identity Matching” (Lisbach and Meyer, 2013).
The main sources of variation in names are:
- Transcription variants
- Phonetic misspellings
- Derivative name forms
- Typos and OCR errors.
This feature focuses on transcription, which can be considered the Achilles heel of many matching systems.
Transcription Variants
Transcription is the process of converting names from one language or script to another, such that native speakers of the two languages would pronounce the names the same. The phonetics of the two languages involved often mean that many different transcription variants are possible.
The name Ельцин is a good example of this. The most common English transcription, Yeltsin, sounds the same as the common German transcription, Jelzin, due to the softer sounding German “J” and harder German “z”. The following examples show how widely transcriptions from Cyrillic can vary.
Original Russian Cyrillic |
English Transcription |
German Transcription |
French Transcription |
Никита Сергеевич Хрущёв |
Nikita Sergeyevich Khrushchev |
Nikita Sergejewitsch Chruschtschow |
Nikita Sergeïevitch Khrouchtchev |
Михаил Сергеевич Горбачёв |
Mikhail Sergeyevich Gorbachev |
Michail Sergejewitsch Gorbatschow |
Mikhaïl Sergueïevitch Gorbatchev |
The Arabic example:
Transcriptions from Arabic vary even more widely, partly due to the number of Arabic dialects in use. Word boundaries often change in transcription, for example,الرحمن عبد can be transcribed in one part as in Abdourrahmane, in two parts as in Abdul Rahman, or in three parts as Abd ur Rahman.
The following examples show how varied transcriptions from Arabic can be.
محمد |
Muhammad, Mohamed, Muhammed, Mhamed, Mouhammad, Mukhamed |
جمعة |
Jumaa, Joumah, Djum’a, Dschum’aa, Djomaah, Joum’aa |
سليمان |
Suleyman, Soliman, Souleymane, Suleiman, Soulaiman |
الدين عز |
Izzadin, Ezedine, Ezzedin, Izz al-Din, Ez aldeen, Izzuddine |
عبيد |
Ubaid, ‘Ubeid, ‘Obayd, Oubeyd, Abid, Abeed, ‘Abeed |
القذافي |
Qaddafi, Kadhafi, Gadhafi, Al Quathafi, Khadafy, Kazzafi, Gheddafi, El Quathafi |
The Chinese example:
The number of different transcription standards in use around the Chinese-speaking world, and the fact that many Japanese, Korean and Vietnamese names are also written in Chinese characters, mean that transcriptions of names written in Chinese can vary widely.
Chinese |
Hanyu Pinyin |
Jyutping |
Other variants used in PR China, Taiwan and Singapore |
Other variants used in Japan, Korea and Vietnam |
李 |
Li |
Lei |
La, Lee, Lii |
Lee, Ly, Rhee, Ri |
王 |
Wang |
Wong |
Heng, Ong, Vong, Wong, Yu, Yuh |
O, Vuong, Wang |
黃 |
Huang |
Wong |
Hwang, Wang, Vong, Ng, Eng, Wee, Oei, Ooi, Bong, Uy, Ung |
Hoang, Huynh, Hwang, Ko |
吳 |
Wu |
Ng |
Eng, Go, Goh, Gouw, Ngo, Ung, Woo |
Go, Kure, Ngo, Oh |
Solving the Transcription Issue
Linguistic similarity keys are the only practical approach to the transcription issue. The approach is similar to the old phonetic matching algorithms, but uses many more rules and, most importantly, uses different rule sets for each combination of source and target language.
Selecting a system that will accurately address all the linguistic search issues is an extremely complicated task, made more so by the fact that many market-leading compliance tools do not use the most effective search techniques. As a result, it is vital that any investment in compliance technology is preceded by thorough testing of the search component.
FreePint Subscribers can read the full version of Victoria's article "Linguistic Identity Matching".
Editor's Note: Compliance in Context
This article is part of the FreePint Topic Series: Compliance in Context, which runs from September to October 2013. Register your interest, and you'll get pre-notification of when registration opens for any webinars in this series, as well as a free copy of the FreePint Report: Buyer's Guide on Regulatory Compliance when we publish in October.
Platinum Sponsors:
Platinum Sponsor - Bureau van Dijk |
Platinum Sponsor - LexisNexis |
- Blog post title: Combat Transcription Errors With Linguistic Identity Matching
- Link to this page
- View printable version
- Linguistic Identity Matching
Monday, 7th October 2013 - Overview: Bridger Insight XG
Wednesday, 25th September 2013 - Compliance Product Vendor Map: Who Does What in the Regulatory Space
Wednesday, 4th September 2013 - Product Review of BvD Compliance Catalyst: Introduction, Value and FreePint's View
Monday, 19th August 2013 - Mini Review: Lexis Diligence
Thursday, 27th June 2013 - New Compliance Tools to Speed Up the Onboarding Process
Monday, 24th June 2013
- Key Criteria for Effective AML Compliance Screening
Wednesday, 2nd October 2013 - Recognising Compliance Risks - Practical Insights
Wednesday, 25th September 2013
Community session
11th December 2024
2025 strategic planning; evaluating research reports; The Financial Times, news and AI
5th November 2024
How are information managers getting involved with AI? Navigating privacy, ethics, and intellectual property
- 2025 strategic planning; evaluating research reports; The Financial Times, news and AI
5th November 2024 - All recent Jinfo Subscription content
31st October 2024 - End-user training best practice research
24th October 2024
- Jinfo Community session (TBC) (Community) 23rd January 2025
- Clinic on contracting for AI (Community) 11th December 2024
- Discussing news and AI strategies with the Financial Times (Community) 21st November 2024