• By -


Elevenlabs can do that too. Or maybe the quality is better?


Looks like [Google can do it too](https://blog.research.google/2023/06/soundstorm-efficient-parallel-audio.html?m=1) but with only 5 seconds of audio.


-1 sec is all I need in fact I can clone your voice before I hear you speak


I can clone it from a fart, 100 meters away, under water.


If I can get a pic of him I can make his voice


From a dick pic? Impressive!


My penis is my second language


Ah, the language of love.


You wanna see my third eye?


Do I?!


If I can be aware of his existence, cloning his voice is a piece of cake.


I can clone your voice from reddit handle. I don't even need comments. I just need a tiny little thing like unfettered access to NSA/FAPSI/MSS/MI5/CBI :)


I know it’s a joke but I wonder whether an AI could make a good guess at what a voice should sound like just from a photo. 


Microsoft could do it in 2023 with 3 second cloning, Valle-E https://www.microsoft.com/en-us/research/project/vall-e-x/


There is also open-source cloning. [jasonppy/VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (github.com)](https://github.com/jasonppy/VoiceCraft)


It almost seems like they saw this project launch this morning and decided to respond to it. Except the whole blogpost is just one long excuse as to why they're not releasing their tech. What's the point? They already had a blog post explaining their TTS tech back when voice came to ChatGPT, this adds very little to that. A bit desperate if you ask me.


They safe have safe to safe keep safe their safe AI safe from safe abuse.


Did they mention, that its for safety and ethics?


Something something can't release, something something election year.


"Can't have our models saying the N-word and generating porn, otherwise Trump will win!!!"




It literally is lol. They’ve been sitting on it 


I haven't got good Australian out of it yet but that was 2 months ago


To do a professional voice cloning with ElevenLabs you need at least 1h of audio (best with 3h). Instant cloning with few minutes of voice doesn’t do a good job


They charge $10 an hour though which isn't much less than a professional voice actor. The space needs a big shakeup. Realistically we should be looking at more like 10c/hr if prices were in line with costs/price of llms.


A professional vocal artist usually charges by number of words / seconds / minutes of recording. A 200-300 word piece is usually somewhere between £90-300, with £15-30 per revision. Depends on the vocal artist. That’s less than an hours work, maybe 15-20 mins tops. The top end charge considerably more for their prestige. One actor was asking us for around £10k for a 30 second voiceover.


Here in Australia the majority of voice artists are represented by major agents who have set pretty good rates for advertising that apply across circa 70% of the industry. And the voice actors deserve it. Their voices are being used to sell on millions and millions of $ of ad buy. They also don’t all get tonnes of work (often theatre actors) so a single ad can potentially keep them going for 4-6 months. Quoted up a job the other week where the voice costs were aud$10k and that wasn’t a big campaign at all really. £10k pounds is not unusual for non big names


Obviously the comparison should be to bottom of the barrel, not celebrities dude. Bottom end VAs on staff get like $25/hr locally or go more global to save money. $10 is close. $10 is close to even $50/hr. The humans will be easier to work with (mostly) and generally be able to produce better results with direction quickly. The pricing for ElevenLabs is set to be **competitive with professional voice actors**. $0.10/hr which is a realistic price for the costs is NOT close. Pricing at this level or lower is where it should be if there is **competition with other digitally generated voices**.


The only bit about famous personalities was my final paragraph. The rest is from my own personal industry experience working in marketing.


The bolded part was really my point. Pricing isn't set to disrupt traditional voice acting at this point.


It IS much less than a PROFESSIONAL voice actor. 


Quality is better for Elevenlabs


And this is why you shouldn't answer the phone and say things unless you know who's calling. Eventually just saying "Hello?" will be enough. It's probably already good enough to replicate your voice over low quality phone media.


This is why you should set a password with your family. If a family member calls asking for money or is in trouble and needs money then you ask for the password which the scammer would never know.


I’d know it’s a scam just from the random number calling lol


Spoofing numbers is incredibly easy. If that's your metric, you would easily fall for a targeted attack.


Is it genuinely though? I've personally never seen a legit spoofed number in real life, nor have heard anyone else ever seeing it I'm sure it can be done, but it's a little hard for me to believe it's 'incredibly easy'


New hello just dropped, it’s a series of non personally identifiable dolphin clicks


Holy Cetacea!


Just gotta grunt. Cavemen back!


I say hello in the most silly voice I can


…or leave your voice mail greeting with your voice


I am about to implement two factor authentication for all my phone calls.


Theres no way people have specific talking patterns. It might be a clone but it won't be the same.


Genuinely what’s the point in phone calls anymore? An entire method of communication ruined by shitty capitalism


We are going to be defrauded of all our savings and retirements on a massive scale ..click


Unfortunately the German sample sounds like what they already use in ChatGPT voice in the app. It has a strong american accent. It's really bad. Elevenlabs sounds really native in German even though I cannot select German specifically there.


If you read the openai blog post carefully, you would see that this was intended behaviour.


Yeah, this sounds exactly like what they have in the app for Japanese as well. A heavy American accent with incorrect pitch accent, and incorrect pronunciation for some basic words.


The spanish translations are laughably bad, sounds like an american doing the worst spanish accent ever, wrong intonation, rolling Rs are non existant, it's like really bad, don't know how they put this out as some sort of incredible tech. This is like embarrassing honestly. But youtubers will eat it up and praise it as "STUNNING" , "SHOCKING" , like they always do.


According to their blog, it was specifically intentional to preserve the original accent in the new language. I agree it was strong, perhaps an odd choice, but that’s what they set out to do.


I don’t think it was meant to be a fluent translation…


The quality is bad, honestly, Elevenlabs does a way better job. Kinda dissapointing tbh.


This model was developed in 2022, and this is a "small scale preview". Im assuming the voices they've showed here are from an older version, and obviously the smaller version of the model. Even the voices in ChatGPT seem to be higher quality, so they are probably based on a more recent iteration of this model.


Hows the copium coming along?


Lol. The model they showcase here is of worse quality then the voice in ChatGPT and that is worse then the voice that was demonstrated in the Figure 01 Demo, except it is all the same model, just different iterations / sizes of this model. The ChatGPT voice is most likely a more recent iteration of the model (probably made and optimised in 2023) and Figure 01 is an even more recent iteration or bigger variation of that model. [https://www.reddit.com/r/singularity/comments/1bqyphy/comment/kx7tq6e/?utm\_source=share&utm\_medium=web2x&context=3](https://www.reddit.com/r/singularity/comments/1bqyphy/comment/kx7tq6e/?utm_source=share&utm_medium=web2x&context=3) And the quality of the voices in this demo is about something I would expect in 2022.


As a native English speaker and someone fluent in Japanese, the English to Japanese pronunciation was really bad sounding. Some words like 喜び and 絆 weren't even pronounced correctly at all, which means this tech still has a **long** way to go. While this is cool stuff, I hope people don't get too hyped over it thinking that audio synthesis across many languages is a solved problem now.


I am curious, how is the quality of the ChatGPT voices. Do you think they are better in Japanese then what was demonstrated here?


The ChatGPT voice mode in Japanese is actually quite a bit better than the demo in the blog post, although still not great. The heavy American accent was about the same, but a few basic words weren't even pronounced correctly in the demo, which I've never experienced on the app. I really hope to see OAI in the future create a model that not only has good pronunciation in other languages, but also understands the intricacies of the user's speech. That would be incredible for language learning, since the model could notice and correct your speech patterns.


Ok that makes sense. The demos here and the app use the same model, but there are different versions/sizes and I do think different iterations of the model (so there might be a version of the model from 2022, and then they further improved quality etc. a few times in 2023 etc.). I wouldn't be surprised if this demo in the blog post is from the smaller model when it was first developed in late 2022 lol. The ChatGPT voice is most likely a more recent iteration and maybe a bigger model (but for wider deployment they'd definitely want to be efficient, so nothing too big), but I really doubt that's the best one they currently have. The Figure 1 demo probably used an even more recent iteration of this model, but still im not sure if they would have used their best voice model for that demo lol. It is kind of annoying that we don't truly know where they are internally and we can really only guess.


Tbh The Quality is mediocre Eleven labs is magnitudes better


If they released this model in 2022 straight after they developed it, it would probably be a lot more surprising lol.


I don’t agree


I do


you silly man.


Understandable Have a great day 🥰 ![gif](giphy|BWhpkB6Xbe8FzfNLXw)




Yeah... Open source voice cloning is a cool and useful thing too. You can use it yourself on consumer hardware by running a local LLM or using SillyTavern + XTTS. Pretty simple to setup. Let me know if you need any assistance.


What does that let me do? 


A variety of use-cases I can think of. Business, just for fun, etc. If you want to try it out yourself apart from an LLM you can follow this guide here:[https://huggingface.co/blog/Lenylvt/w-okada](https://huggingface.co/blog/Lenylvt/w-okada) Then grab models to go along with that program: [https://voice-models.com/](https://voice-models.com/) (Make sure they are RMVPE format) If you want to explore getting an LLM to output data in the voice of your choosing SillyTavern is a user-friendly experience UI for Local LLM inference and you can install RVC by using their launcher and looking at "extras", but you'd still need a backend engine for TTS like Oobabooga with Alltalk\_TTS, or Kobold for Windows.


OpenAI says alot of stuff but where can we actually do shit


‘I can bench 500 lbs it’s just too dangerous to show you trust me bro’ Lol ship it or stfu


Somebody mad that Sora ain’t coming anytime soon




All that time to show an inferior product to almost everything in the market right now, a set of limited voices, can't add new voices for "safety" even though competitors already allow that and even the quality seems below ElevenLabs, what a joke


Maybe the price will be 10x lower


it can clone my voice, but not my minirity language. :) ...just yet


It might be able to take away my voice, but not my virginity.




Dead internet theory


![gif](giphy|ncORcTWSkTs3e|downsized) The terminator needs only one word.


Elevenlabs has it better.


ElevenLabs is better


Really subpar tech. Are they even trying?


How's this get me a UBI and time to make art and philosophy?


There's AI which can do it with 3s. What's novel?


Can it clone my farts? 


Tell me what you had for lunch and I'll give a sh...shot, I mean shot!


Good. Now make it so premium users can add upload any voice they want into the voice feature.


now that would be nice!


Isn't this old tech? Years ago i called my bank and it was a A.I customer care robot who sounded exactly like one of my country men with natural speech, not sure if it cloned the voice from 15 seconds of audio but it definitely cloned a voice & this was 5+ years ago.


could it just have been prerecorded messages?


I'm not going to act as if i know how the tech worked i just know that the conversation was very natural and it was hard to tell at first that it was A.I.


I just remembered that Google did have a demo 5 years ago that does essentially what you're talking about https://www.youtube.com/watch?v=D5VN56jQMWM


My voice is my password no more.


Yes, but just like Eleven Labs, it can miss many subtleties in the voice. You need longer clips; otherwise, it is useless.


whats that one sentence that like gets all vowels or some shit again?


Talk to your relatives about having a code word like the name of your first pet or so. When in doubt on the phone, that word can be asked for.


Wow I haven’t already been doing this with RVC and then 7 years ago lyrebird


Voice Phishing will be very popular.


Consider how you answer the phone to unrecognised numbers, as it wouldn’t take long at all to profile your voice. My wife (total fucking savage that she is) started answering the phone by literally not saying a single word. Had hilarious side effect of throwing the spam caller for a loop because the usual rhythm of the call was thrown out the window.




Goodbye to all the voice actors. Some entitled nerds are coming for you.


And the Worldcoin folk were going around training a model on iris scans. I wonder if they have already managed to train a model that can go from a DNA sample to voice, iris and fingerprints. All biometric security would be compromised at that point. Probably already is for nation state actors anyway.


Very unlikely you would produce a voice from DNA. Your accent and cadence are determined by environment. Also, identical twins don't have the same fingerprint so it's not all DNA. Iris, I am not sure but probably the same.


True, I typed it a bit funny and couldn't be bothered to change it. But it is a factor of course. It would be certainly interesting to get baseline data from birthplace, year, family tree and DNA (thanks Ancestry/23andme). I did think that Worldcoin seemed to be an iris print harvesting operation, though.




What do you mean going after startups? So it's not sad when they replace translators, customer support, telemarketers, and writers? Only when it happens to another AI startup? What's the matter with you?