Voice deepfakes may be hindered by real-time detection models

by Dain Oh

Dec. 21, 2023

12:25 PM GMT+9

Okinawa, Japan ― MobiSec 2023 ― While phone scammers are taking advantage of cutting-edge artificial intelligence technology to exploit their victims more adroitly, security researchers have joined forces to identify the weakest points in their deceitful process to create countermeasures against this latest threat.

Jung Souhwan, professor of electronic engineering at Soongsil University and head of South Korea’s AI security research center (AISRC), shared the latest research findings in detecting voice deepfakes at the Seventh International Conference on Mobile Internet Security (MobiSec 2023), which took place in Okinawa from December 19 to December 21.

During his keynote speech, Jung explained how sophisticated voice deepfakes have become. For example, scammers cloned a Korean woman’s voice using AI and attempted to extort money from her mother last August, according to a local news outlet that had confirmed the AI-generated voice with the help of Jung.

Okinawa, Japan ― Jung Souhwan, professor of electronic engineering at Soongsil University and head of South Korea’s AI security research center (AISRC), is delivering a keynote speech at the Seventh International Conference on Mobile Internet Security (MobiSec 2023) on December 19. Photo by Dain Oh, The Readable

The expert referred to voice deepfakes as deep-voice. Generating a deep-voice can be accomplished more quickly and easily than ever. Jung highlighted that fraudsters no longer require lengthy samples of the target’s voice to create a believable “deep-voice” fake; rather, they can train a deep-voice model within mere seconds with a bare minimum of source material.

Jung explained multiple approaches that can be used as countermeasures against deep-voice, including text-to-speech (TTS), voice conversion (VC), and automatic speaker verification (ASV). Most recently, there is the breathing-talking-silence encoder (BTS-E), the latest and most up-to-date deep-voice detection model. BTS-E utilizes a human speaker’s breathing, talking, and silence signal in the sound segmentation stage of model training. This is more advanced than the TTS model because TTS focuses merely on vocalized linguistic content, irrespective of non-speech segments, where BTS-E takes unspoken aspects of human speech into account.

These research outcomes were published through ICASSP 2023 under the title of “BTS-E: Audio Deepfake Detection Using Breathing-Talking-Silence Encoder.” Jung participated in another research paper as an author, “On the Defense of Spoofing Countermeasures Against Adversarial Attack,” which was published through IEEE this year.

“In the field of security, a defender must be far more powerful than an offender,” said Jung. “If a defender uses the same technology as an offender, such as AI, the offender has the upper hand because both parties’ capacities become equal,” elaborated the expert.

His statement was made while questioning the effectiveness of AI regulations which have been dominating the world recently. For instance, this year witnessed international pleas for responsible AI development being eclipsed by rapid progress in research and technological advancement in AI. Jung did not say that multiple governments’ efforts to regulate AI development were meaningless; however, he did stress that advances in the overall technology must be matched by governments if there is to be any hope of defeating fraudsters at their own game.

“It is an arms-race between AI generation and AI detection,” stressed the professor. “Regulations and governance are not enough. Technical transparency, realistic data, and regulation should be aligned concurrently to use AI as our opportunities,” added Jung.

Dain Oh: Author

Dain Oh is a distinguished journalist based in South Korea, recognized for her exceptional contributions to the field. As the founder and editor-in-chief of The Readable, she has demonstrated her expe...
View all posts

Copyeditor: Arthur Gregory Willers

Cybersecurity News that Matters

Cybersecurity News that Matters

Voice deepfakes may be hindered by real-time detection models

by Dain Oh

Subscription