RealDash — How I Make Audios (/mlp/ con)


                                            1.
                                            
                                                Hello there, anons! I am RealDash, one of the many anons responsible for some of the more in-depth audios surrounding the Pony Preservation Project, inspired by folks like BGM, SpoopyAnon, Clipper, and others who've remained nameless. I'm not here directly to participate (due to a mix of wanting to keep anon, no microphone, and busy personal life), but I still wanted to share some of my thoughts and how and why I create my audio work the way I do.

                                            2.

                                            3.
                                            
                                                But I guess I should start with the basic beginning.

                                            4.

                                            5.
                                            
                                                WHEN AND WHY I JOINED THE PPP

                                            6.

                                            7.
                                            
                                                I joined the PPP around October 2019, just after the series finale. I was browsing /mlp/ and wanted to see what everyone's thoughts were. I quickly came across the PPP thread, one of the first few at the time, and gave it a quick look. Some examples like Tacotron and TwiBot were in their early stages, and they sounded already interesting, but nothing got me hooked because I thought it would still be years away before they sounded exactly like the characters.

                                            8.

                                            9.
                                            
                                                Two days later, 15 joined the game and presented his work. Extremely rough, but showing promise. I stayed aboard and lurked the threads just to see what was developing. Months pass, and by March 2020, 15's website was released with the Mane Six as the only voices. I was hooked, insanely impressed with what was going on, and I didn't feel like leaving.

                                            10.

                                            11.
                                            
                                                WORKING ON AUDIOS

                                            12.

                                            13.
                                            
                                                I uploaded my first audio to the PPP thread on March 31, 2020 (https://desuarchive.org/mlp/thread/35129047/#35147842 ), an audio about Applejack. It was something simple and I didn't really have much planned for it. I just wanted to see what I could do with 15ai in its state at that point. I was impressed, but I wanted to see if I could make them sound more... genuine. I generated even one line just a few times to try to get something similar to what I wanted, then spliced two or more pieces together to get something that sounded more realistic.

                                            14.

                                            15.
                                            
                                                After seeing everybody's reaction regarding how realistic it sounded, I wanted to see if I could continue that concept. So I did another audio two days later, on April 2, this time focusing on Rainbow Dash (https://desuarchive.org/mlp/thread/35149462/#35160481, and https://desuarchive.org/mlp/thread/35163852/#35172486 for the updated version afterwards). I had the whole scene in my head, and I spent a few hours generating 2-3 variations of a line and putting them together, all because some parts sounded really, REALLY good. As in, when I put two pieces together, it didn't sound like a TTS line, it sounded like Rainbow Dash was talking to me. THAT good.

                                            16.

                                            17.
                                            
                                                It sounded so real. And that's why I'm called RealDash. I came up with the name just as a joke, but it stuck with me and I've kept it since because of my goal. Which was to make the best use of this project and make these characters sound as realistic and natural as I can get them, so every anon feels just a little bit closer to their waifu... well, in this case, fans of Rainbow Dash.

                                            18.

                                            19.
                                            
                                                HOW I MAKE AUDIOS

                                            20.

                                            21.
                                            
                                                How I've made audios up to now hasn't really been all that different. Add some sound effects, some potential reverb, audio control, voices, and viola. However, prior to October 2020, my audios were pretty much "make it up as I go" deals, in which I had the plot in my head, it was a matter of figuring out what to make them say. After I finished Part 2 of my Rainbow Dash audio in October, someone gave me a critique saying I should make scripts next time to avoid another "bad ending" (which yeah... it was (I still plan to fix that though)). Ever since then, I've done just that.

                                            22.

                                            23.
                                            
                                                -SCRIPT

                                            24.

                                            25.
                                            
                                                Whenever I start to work on an audio, the first thing I do is work on a script. Once I have the character in mind, I play the scene in my head and write out the lines I want them to say. Sometimes this takes ten minutes, other times more than a day (especially if I'm lazy). But once I know what I want them to say, I head over to 15ai (or if it's down, the Ngroks whenever a link is open), and put the line into the text box. I usually play around with the composition to get a different emotion from them. Usually it's "..." at the beginning or ",,,,," at the start and then the sentence, then copy-pasting the line a few times to get them into a soft tone of voice instead of their usual "default" tone.

                                            26.

                                            27.
                                            
                                                -SPLICING

                                            28.

                                            29.
                                            
                                                In order for me to get the natural tone to a character's sentence, it usually consists of me splicing several generations and combining them together. In 15ai's earliest versions, a single sentence was spliced at most 3 times. In V5.0 or some later versions, at the most I'd have to split a sentence into 5 different pieces. The latest version (V15.3 at the time of this) has improved in terms of emotive quality, but I usually splice at the beginning or end of a sentence because of how it starts or ends. If I want them to sound like they're making a statement, but it ends sounding like they're asking a question, I have to splice that with another variation, which could take several versions to get one that doesn't sound too fast or too slow. But in the end, it makes them sound more like what you want them to sound like.

                                            30.

                                            31.
                                            
                                                -EMOTION

                                            32.

                                            33.
                                            
                                                Emotion is a bit tricky with 15ai. While in the last few versions, the models have vastly improved in emotion, sometimes to get them to express their dialogue in the way you hear it requires a few things: Splicing, generating, and saying it yourself. Splicing and generating is simple enough to explain, so we'll skip to 'saying it yourself.' While I'm generating a specific line for a specific point in the audio, I say the line out loud the way I want it to sound, then as I repeat it aloud, I listen to the generations that were produced. If there's even a single point in one of the generations that sounds exactly how I say it, I save that particular piece and repeat.

                                            34.

                                            35.
                                            
                                                EXAMPLE 1: https://u.smutty.horse/mazzvhgyuyr.mp3

                                            36.

                                            37.
                                            
                                                In this exerpt from my Fluttershy audio, this is created from a few separate voice clips. However, the phrase, "I was SURE it wasn't supposed to be here until the weekend" is actually two different splices. The only splice? "Wee-kend." The rest of that is all one clip. Here, I wanted her to put emphasis on "sure", as in "I was SURE that wasn't supposed to happen." The same applies to the phrase "Maybe they had to change it," a sense of surprise in her voice. This is meant to have her sound surprised and curious at the same time, and when these are combined together, it nails the inflection perfectly.

                                            38.

                                            39.
                                            
                                                EXAMPLE 2: https://u.smutty.horse/mazzwjwtezq.mp3

                                            40.
                                            
                                                EXAMPLE 3: https://u.smutty.horse/mazzwjvkmjd.mp3

                                            41.

                                            42.
                                            
                                                These two exerpts are from my first RD audio, which at the time was my favorite one to work on because of how well the emotions sounded. In both examples, Rainbow is talking to the viewer, and each example depicts different emotions, worry and frustration, with a hint of disbelief.

                                            43.

                                            44.
                                            
                                                "You're my friend! Of COURSE I'm gonna be worried," sounds exactly as I was trying to get it to, where Rainbow places an emphasis on "course" as if saying, "Well duh, I'm obviously worried," while also expressing a form of disbelief, as in she can't understand why the viewer, in sense of the audio, thinks she isn't a worried friend.

                                            45.

                                            46.
                                            
                                                Example 3 is more of a sense of frustration and annoyance, as if saying, "Look at you! You can't even stand properly!" The only splice, if I remember right, is "been here," as the original clip ended sounding like a question.

                                            47.

                                            48.
                                            
                                                -ENVIRONMENT/ATMOSPHERE/SETTING

                                            49.

                                            50.
                                            
                                                This is by far one of the biggest pieces of my audio making, and that's environmental setting or atmosphere. Once you've got the emotions right and the lines spliced and set up the way you want them to be, the next big thing to do is work up the setting and immerse the listener in each and every circumstance.

                                            51.

                                            52.
                                            
                                                In my very first RD audio (https://u.smutty.horse/luipuhqfzfb.mp3 ), the general setting was simple: the main character's (in this case, (you)) house. In any interior, there's going to be reverb, but you don't want too much to sound like you're in the mountains, or too less to sound like you're in an echo chamber. You want the reverb at a minimum, and the wetness at a medium or low setting. Enough to mimic walls being nearby, but not too much to overshadow the voices. If a character is speaking and they're in another room, what I do is copy the settings of that reverb, but turn down the dryness and up the wetness a bit, giving it the illusion that they're in another room and their voice is echoing outward still.

                                            53.

                                            54.
                                            
                                                EXAMPLE 4: https://u.smutty.horse/mbaleqwfxpz.mp3

                                            55.

                                            56.
                                            
                                                In this example, you can hear Rainbow Dash talking to you at first directly in front of you, her foosteps lowering in volume and dryness, and when she speaks, her voice sounds distant but reverberating still. This gives the illusion to the listener that Rainbow Dash has left the current room and is now standing in another room.

                                            57.

                                            58.
                                            
                                                Here, you can also hear the voice and footsteps panning from one side to the next, also helping immerse her position to the listener. Instead of being in front of you or nearby, she's now to your right and afar.

                                            59.

                                            60.
                                            
                                                Also here, after practicing a bit more on the reverb, her close up voice is not as echoey as the first audio, which was my mistake listening back while writing this.

                                            61.

                                            62.
                                            
                                                Another trick for environmental sounds I learned to integrate later on is altering the direction of sound effects too. In my Fluttershy audio, the listener is sleeping in Fluttershy's bedroom, which has a burning fireplace to the left and a blizzard outside to the right where the window is. Throughout the audio, these sounds pan left and right to indicate the listener is moving around. It's touches like this that add polish and can ground the listener into the environment.

                                            63.

                                            64.
                                            
                                                [THIS NEXT PART MENTIONS MY ONLY EQUESTRIA GIRLS BASED AUDIO, SO IF YOU DON'T LIKE IT, SKIP PAST IT, NO TROUBLE!]

                                            65.

                                            66.
                                            
                                                In another audio focusing on Rainbow Dash, I changed the setting to the Equestria Girls-verse inside a car, where the listener is the driver and Rainbow is a passenger. Immediately, Rainbow Dash's voice would be emitting from the right, and in the background, you can hear the car driving and stopping at different points. The car sound is muffled slightly. Later on in the audio, when the situation gets serious, you hear traffic jams and helicopters. To mimic the listener approaching these sound cues, they're slowly faded in. As they drive away, they fade out. To mimic vehicles passing by rapidly, it requires a quick pan of sound, where the car will be audible in both sides, then quickly pan to one side or the other, and as they move away, gradually fade them back in. This creates a sense that you're in a car, you're driving, and there's other cars around you going crazy.

                                            67.

                                            68.
                                            
                                                [OKAY, END OF (>no hooves) TALK]

                                            69.

                                            70.
                                            
                                                -TIMING(?)

                                            71.

                                            72.
                                            
                                                Another big issue to focus on is timing. Timing literally is everything, otherwise something may throw the listener off. This ranges from questions and answers to actions themselves.

                                            73.

                                            74.
                                            
                                                -THE LISTENER

                                            75.

                                            76.
                                            
                                                My audios were made in inspiration from the posts in /r/GoneWildAudio on Reddit (I know, /thread), so when it comes to what's originally a two-sided conversation, when it's the listener's turn to speak, it's a moment of silence to mimic when you're meant to speak. To time this, I say aloud what the listener's response would be and how long it takes to say it, then add the AI voice a moment after.

                                            77.

                                            78.
                                            
                                                -ACTIONS

                                            79.

                                            80.
                                            
                                                For every action a character makes, you have to take into account the setting. Are they in a giant room, or a small one? A temple or a bedroom? How long does it take for them to walk (or fly/teleport) from Point A to Point B? How long does it take for them to feasibly do X compared to Y? In my latest audio focusing on Daring Do

                                            81.
                                            
                                                ( https://u.smutty.horse/mavbsqleala.mp3 ), you hear the characters do a lot of walking, specific silences, and then sound effects to implicate actions. My only method is to act out the scene in my head and aloud, from a character rummaging through a bag to punching a bad guy.

                                            82.

                                            83.
                                            
                                                Timing gives each character and each action weight, like you're hearing a character gearing up their punch or moving through a large room and taking time to get through it, or taking a few seconds to dive into water from a high area. Time it too fast or too slow, and the pacing feels wonky. Time it just right, and you make the listener feel more in tune with what he/she is hearing, and especially with what YOU'RE hearing.

                                            84.

                                            85.

                                            86.
                                            
                                                -SPLICING EXAMPLES

                                            87.

                                            88.
                                            
                                                As I've said before, the results of realistic-ish emotion that I get from the characters is the result of generating multiple variants of the same line and separating all the bits and pieces that sound the way I was going for, then stitching them together, creating more less a far more natural form of dialogue. There are some occasional moments where the majority of a generated line sounds perfect, but the last word or two may end in a question or them suddenly shouting. I generate a few extra times until they end it in a sentence or calmly, then piece it together.

                                            89.

                                            90.
                                            
                                                Here are some examples from my first Rainbow Dash audio. What you hear first will be the original generated audio clips, followed by the stitched together clip at the end.

                                            91.

                                            92.
                                            
                                                "I'm not just gonna sit around and watch you die!" -------> https://u.smutty.horse/mbjvbuvuanw.mp3

                                            93.
                                            
                                                "That's bullshit, and you know it!" -------> https://u.smutty.horse/mbjvbuvbows.mp3

                                            94.
                                            
                                                "You've been struggling to stand the whole time I've been here!" -------> https://u.smutty.horse/mbjvbuvfbxa.mp3

                                            95.

                                            96.
                                            
                                                This is the typical routine for multiple lines throughout a single audio. Even one sentence may need to be spliced a few times in order to get the right tone. You can also hear in the third one just how much of even one generation can provide the right emotion without changing too much. It's just a matter of listening and playing with what you've got. When you know, you just -know-.

                                            97.

                                            98.
                                            
                                                -SCREAMING

                                            99.

                                            100.
                                            
                                                This was a bit of a rushed segment I added last minute, but I wanted to mention the topic of characters screaming genuinely in audios. Depending on the setting, you may have characters screaming almost bloody murder, or at least that's how it's meant to be. But in the current stage of the models, both 15ai and Ngrok, getting the characters to scream is very... tedious.

                                            101.

                                            102.
                                            
                                                For 15ai, it's next to impossible to get them to scream in a genuinely scared manner, but it's possible to get them to sound angry. If you want terrified or mortified screams, the Ngrok is your best bet for now.

                                            103.

                                            104.
                                            
                                                In my Part 2 Rainbow Dash audio ( https://u.smutty.horse/lxqsexzftwb.mp3 ), I had Rainbow shout in an angry/sad manner where it sounds like she begins to burst into tears by the end. I did this in the Ngrok, typing the phrase and ending it with a !!??. I'd then copy/paste that sentence a couple times which helps give the generation a breathy, ragged, and sometimes broken dialogue—this makes them sound more genuine in their upset.

                                            105.

                                            106.
                                            
                                                Shortly before this however, I used this same tactic primarily for Queen Chrysalis and Fluttershy in another side audio ( https://u.smutty.horse/lxkuomurqqp.mp3 ). Here, Chrysalis sounds frustrated and furious, like she's truly had enough of everything. At the same time, you can hear Fluttershy in the background, screaming her goddamn head off and begging Discord to wake up. She sounds terrified, she's yelling, she's crying. It all makes her sound genuine.

                                            107.

                                            108.
                                            
                                                Something like this is very tedious to generate, but much easier to do when using the Ngroks. I'd say 15ai may be a while before they're trained to act on this too.

                                            109.

                                            110.

                                            111.

                                            112.
                                            
                                                -CLOSING

                                            113.

                                            114.
                                            
                                                Unless I'm missing anything else, that should pretty much cover everything that contributes to me making my audios! My audios aren't always going to be perfect: Sometimes getting the perfect generation isn't always possible, or a specific sound effect you need for any specific action may not even exist, but at the end of the day the most important thing to keep in mind is the vision. Even if it takes you a week or a year to get the work done, the most important thing is that you tried. Boiling it all down to the goal of the PPP, I like to share the potential that this project provides, and give listeners a glimpse into what it CAN be.

                                            115.

                                            116.
                                            
                                                And I hope the work I've done regarding stitching these works together helps show everybody that the future of this project is very bright. And I'm hopping along for the ride.

                                            117.

                                            118.
                                            
                                                Thanks to Clipper, BGM, SpoopyAnon, 15, and many more for their contributions to this project, and here's to hoping for many years more!

                                            119.

                                            120.
                                            
                                                -RealDash