Voice deepfakes may be hindered by real-time detection models

Voice deepfakes may be hindered by real-time detection models
Jung Souhwan, professor of electronic engineering at Soongsil University and head of South Korea’s AI security research center (AISRC). Photo provided by Jung Souhwan

By Dain Oh, The Readable
Dec. 21, 2023 9:25PM GMT+9 Updated Dec. 23, 2023 12:11AM GMT+9

Okinawa, Japan ― MobiSec 2023 ― While phone scammers are taking advantage of cutting-edge artificial intelligence technology to exploit their victims more adroitly, security researchers have joined forces to identify the weakest points in their deceitful process to create countermeasures against this latest threat.

Jung Souhwan, professor of electronic engineering at Soongsil University and head of South Korea’s AI security research center (AISRC), shared the latest research findings in detecting voice deepfakes at the Seventh International Conference on Mobile Internet Security (MobiSec 2023), which took place in Okinawa from December 19 to December 21.

During his keynote speech, Jung explained how sophisticated voice deepfakes have become. For example, scammers cloned a Korean woman’s voice using AI and attempted to extort money from her mother last August, according to a local news outlet that had confirmed the AI-generated voice with the help of Jung.

Okinawa, Japan ― Jung Souhwan, professor of electronic engineering at Soongsil University and head of South Korea’s AI security research center (AISRC), is delivering a keynote speech at the Seventh International Conference on Mobile Internet Security (MobiSec 2023) on December 19. Photo by Dain Oh, The Readable

The expert referred to voice deepfakes as deep-voice. Generating a deep-voice can be accomplished more quickly and easily than ever. Jung highlighted that fraudsters no longer require lengthy samples of the target's voice to create a believable “deep-voice” fake; rather, they can train a deep-voice model within mere seconds with a bare minimum of source material.

Jung explained multiple approaches that can be used as countermeasures against deep-voice, including text-to-speech (TTS), voice conversion (VC), and automatic speaker verification (ASV).  Most recently, there is the breathing-talking-silence encoder (BTS-E), the latest and most up-to-date deep-voice detection model. BTS-E utilizes a human speaker’s breathing, talking, and silence signal in the sound segmentation stage of model training. This is more advanced than the TTS model because TTS focuses merely on vocalized linguistic content, irrespective of non-speech segments, where BTS-E takes unspoken aspects of human speech into account.

These research outcomes were published through ICASSP 2023 under the title of “BTS-E: Audio Deepfake Detection Using Breathing-Talking-Silence Encoder.” Jung participated in another research paper as an author, “On the Defense of Spoofing Countermeasures Against Adversarial Attack,” which was published through IEEE this year.

“In the field of security, a defender must be far more powerful than an offender,” said Jung. “If a defender uses the same technology as an offender, such as AI, the offender has the upper hand because both parties’ capacities become equal,” elaborated the expert.

His statement was made while questioning the effectiveness of AI regulations which have been dominating the world recently. For instance, this year witnessed international pleas for responsible AI development being eclipsed by rapid progress in research and technological advancement in AI. Jung did not say that multiple governments’ efforts to regulate AI development were meaningless; however, he did stress that advances in the overall technology must be matched by governments if there is to be any hope of defeating fraudsters at their own game.

“It is an arms-race between AI generation and AI detection,” stressed the professor. “Regulations and governance are not enough. Technical transparency, realistic data, and regulation should be aligned concurrently to use AI as our opportunities,” added Jung.

ohdain@thereadable.co

This article was copyedited by Arthur Gregory Willers.


Dain Oh is a distinguished journalist based in South Korea, recognized for her exceptional contributions to the field. As the founder and editor-in-chief of The Readable, she has demonstrated her expertise in leading media outlets to success. Prior to establishing The Readable, Dain was a journalist for The Electronic Times, a prestigious IT newspaper in Korea. During her tenure, she extensively covered the cybersecurity industry, delivering groundbreaking reports. Her work included exclusive stories, such as the revelation of incident response information sharing by the National Intelligence Service. These accomplishments led to her receiving the Journalist of the Year Award in 2021 by the Korea Institute of Information Security and Cryptology, a well-deserved accolade bestowed upon her through a unanimous decision. Dain has been invited to speak at several global conferences, including the APEC Women in STEM Principles and Actions, which was funded by the U.S. State Department. Additionally, she is an active member of the Asian American Journalists Association, further exhibiting her commitment to journalism.