Synthesising the sound of a car engine based on envelope decomposition and overlap smoothing

The synthesised sound of a car engine is used to alert people to the approach of an electric vehicle, to personalise the sound of an engine and for virtual reality. A methodology for synthesising engine sound based on concatenating samples is proposed. First, using filtering, the engine sound is decomposed into a combination of low-frequency harmonics that depend on the engine speed and high-frequency narrowband amplitude-modulated signals. The high-frequency signals are modulated by the harmonics that depend on the engine speed. The carrier and envelope of the amplitude-modulated signal are extracted with a Hilbert transform. The decomposed segments are concatenated by overlap smoothing. All the concatenated segments are assembled to form a synthesised sound. Finally, the synthesised sound is evaluated using the cepstrum distance and subjective auditory experiment, and it is compared with the raw engine sound and other synthesised sound.


Introduction
Generated car engine sounds are useful in many situations. Electric vehicles are so quiet that an additional warning sound may be necessary [1], and the sound of an internal combustion engine has been suggested [2][3][4]. Some high-performance cars require a customised engine sound to meet the customers' needs [5,6]. Synthetic engine sounds are used in the virtual car driving simulators and virtual reality systems, which are used in driver training [7].
Existing approaches for generating an engine sound include spectral modelling and sampling [8]. Spectral modelling considers sound as a composition of a deterministic signal and a stochastic signal [9][10][11][12][13][14]. In this method, the sound model is simplified to reduce computational complexity. Engine sounds are regarded as a combination of low-frequency harmonics that depend on the engine speed and high-frequency stochastic noise. This method can effectively simulate the low-frequency part of the engine sound to produce a target sound. Spectral modelling can be used to personalise the sound, improve detectability and decrease the annoyance of a warning sound [2,9]. A disadvantage of this methodology is that it requires detailed and complex calculations to reduce the monotony [8].
The other main method of synthesising engine sound is based on sampling [3,[15][16][17][18], which is called speech synthesis. It records real engine sounds as samples and chooses segments for the synthesis. This method can replicate most features of the sound. Concatenating different sound segments is key to the synthesis. A typical method for engine sound synthesis based on sampling is PSOLA (Pitch Synchronous Overlap and Add) [15]. PSOLA assumes that a sound is composed of independent segments, the widths of which are determined by a fundamental frequency. The gaps between sound samples must be considered. This method can synthesise low-frequency components well. However, it neglects the high-frequency components, and audible artifacts may be induced when looping the same sound sample.
In this study, synthesising engine sound based on sampling is discussed since the use of sampled engine sounds is widespread in practice [3,8]. The auditory characteristics of synthesised engine sound can be improved by minimising the discontinuities at the concatenation points. High frequencies are concatenated in smoothing because of their contribution to the auditory characteristics. In this paper, the high-frequency part of engine sound is considered to be composed of different carriers and envelopes, and it is separated by 1/3 octave filtering and a Hilbert transform. Overlap smoothing is used to decrease the discontinuities due to the concatenation. Subjective and objective evaluations show that the synthesised signal is closer to the raw engine sound.
The structure of the paper is as follows. Section 2 succinctly introduces engine sound, and a model of engine sound is built. In Section 3, the decomposition and synthesis algorithm of engine sound are discussed. In Section 4, the process of decomposition and synthesis is detailed. In Section 5, the validity of the proposed approach is verified through subjective and objective evaluations. Finally, our conclusions are summarised in Section 6.

Acoustic characteristics of engine sound
Engine sound comes from various sources, including the movement of pistons and valves, combustion in cylinders, the friction of belts and bearings, the intake and the exhaust. The various sources have different frequency distributions, all of which are related to the rotational speed of the engine. The fundamental frequency of engine sound, which is caused by in-cylinder pressure changes, also depends on the rotational speed. Therefore the low-frequency part of engine sound is dominated by the fundamental frequency, with lower amplitudes at the first and second orders. The higher orders are masked by noise, as shown in Fig. 1. The fundamental frequency for a four-stroke internal combustion engine is given by: where is the number of cylinders and is the rotational speed. The high-frequency part of the engine sound is amplitude modulated by the fundamental frequency and its orders [19,20]. Amplitude modulation plays an important role in sound recognition [21,22]. Using only amplitude modulation, the sound recognition score can be over 85 % in a quiet environment [23,24]. Yasui noted that the amplitude envelope affects the detectability of a vehicle's warning sound [25]. Therefore, the amplitude modulation of engine sound should be considered in detail at high frequencies. The envelope spectrum of a narrowband component at high frequencies is extracted, as shown in Fig. 2. The peaks of the envelope spectrum are the fundamental frequency and its orders.
According to the above analysis, the low-frequency part of engine sound is a deterministic signal composed of engine orders, whereas the high-frequency part is composed of amplitude-modulated signals. The modulated frequencies are also engine orders. Zeller suggested that orders up to 18 are relevant [26]. However, the orders more than the third are masked by other noise, as shown in Figs. 1 and 2. Therefore, engine sound can be modelled as: where represents the engine orders, = , , 1, 2, 3. is the fundamental frequency, and and are the amplitudes and initial phases of the harmonics, respectively. cos( + ) is the narrowband high-frequency signal, where is the centre frequency of the narrowband signal. is the amplitude of the modulated signal, and is the modulation factor. ∑ sin( + ) / is the modulating signal, and is the engine order, = , , 1, 2, 3.

Decomposition
The low-frequency part is composed of the fundamental frequency and its harmonics. It is obtained by a low-pass filter.
The high-pass components are amplitude modulated. For a high-frequency component ( ), the Hilbert transform is defined as: The Hilbert transform of ( ) can be regarded as a signal passing through a full-pass filter with amplitude 1. Through the transform, the positive frequency component is phase-shifted to -90° and the negative frequency component is phase-shifted to +90°. Therefore, the Hilbert transform is suitable for extracting the envelope only of a narrowband signal.
An amplitude-modulated signal can be described as: where ( ) is the amplitude varied with time, and cos( + ) is a cosine signal.
To satisfy the narrowband condition, ( ) should be a slowly varying signal compared to cos( ). Here, is the carrier frequency.
For the engine sound model, a narrowband filtered component of the high-frequency part is: The envelope and carrier can be obtained by a Hilbert transform as follows. The envelope is: The carrier is: where ( ) ≥ 1 and the amplitude of the carrier is 1. Finally, the decomposition process is summarised in Fig. 3. The engine sound is decomposed into three groups: the low-frequency part composed of the fundamental frequency and its orders, and the carriers and envelopes of the high-frequency narrowband components. A 1/3 octave filter can be used to obtain the high-frequency narrowband components from the high-frequency part. These components are synthesised by the overlap smoothing algorithm, which is described later.

Overlap smoothing algorithm
The steps of PSOLA can be described as follows: (1) The signal is decomposed into separate segments by windowing at particular time instances. These instances are positioned pitch synchronously and are called "pitch markers".
(2) Optional modification of these segments, including pitch and speech rate.
(3) Recombination of the segments by means of overlap-adding. The proposed synthesis method uses engine sound samples of different engine speed. The pitch and playback rate would not be modified. The overlap smoothing algorithm consists only of steps (1) and (3).
When two segments are concatenated, the end frame of the first segment and the start frame of the next segment are overlapped. The discontinuity between the frames has to be eliminated [27][28][29]. An example is given in Fig. 4.
The first step of the overlap algorithm is to add pitch synchronisation windows (Hann windows with a length of over one pitch period) for segments 1 and 2, respectively. JOURNAL OF VIBROENGINEERING. AUGUST 2021, VOLUME 23, ISSUE 5 The length of the overlap of the segments has to be determined, as it affects the phase displacement and amplitude displacement of the segments.
As the overlap length becomes longer, the phase difference between the two segments becomes larger. The phase difference can be described as: As the overlap length becomes shorter, the amplitude difference between the two segments becomes larger. The amplitude difference can be described as: An overlap length of 50-75 % of the pitch length is appropriate [27]. The second step is processing the overlap. Let the weight of segment ( ) be ( ), which decreases linearly with time. Let the weight of the next segment ( ) be ( ), which increases linearly with time. The overlap frame can be described as: where ( ) + ( ) = 1.
The overlap frame is a combination of the segment ( ) with a linear decrease over time and the next segment ( ) with a linear increase over time, which reflecting the smoothing transition of the engine speed.
According to the characteristic of the Fourier transform: There is a weighted concatenation in the frequency domain corresponding to the overlap in the time domain. So, the spectrum of the overlaped frame is a weighted combination of the two segments, which ensures that the frequency transition is natural and eliminates the spectral mismatch.
The overlap smoothing algorithm can be described as follows: 1) Determine the pitch and add pitch synchronisation windows for segments.
2) Determine the length of the overlap.
3) Recombine the segments by weighted concatenation.

Recording engine sounds
Sound samples were collected from a four-cylinder gasoline engine. A microphone was placed in the engine cabin to record the sounds, and the rotational speed was obtained from the on-board diagnostics. The engine sounds were recorded at intervals of 50 RPM from idle to 2400 RPM. The recording length was 15 s.
The sound samples are divided according to Fig. 3. The cut-off frequency of the 1/3 octave filter was 282 Hz, which divides the high and low frequencies. When the engine speed is 2400 RPM, the fundamental frequency is 80 Hz, and its third harmonic frequency is 240 Hz. The high-frequency part was further filtered by the 1/3 octave filter to obtain the narrowband components. Then, the Hilbert transform was adopted to obtain the carriers ( ) and envelopes ( ).

Synthesis of the low-frequency part
The low-frequency part is dominated by the fundamental frequency. It includes the first and second harmonics and a little clutter. We synthesise the sound segments with the overlap smoothing algorithm. When the next sound segment is to be played, we estimate the phase at the end of the current segment ( ). The initial phase of the next segment ( ) is determined through pitch synchronisation. The two segments are then overlapped smoothly and added together. The length of the overlap is 50 % of the fundamental wavelength. A concatenation of two segments is shown in Fig. 5.

Synthesis of the carrier at high frequencies
The carrier of a high-frequency narrowband component obtained by the Hilbert transform is ( ) = cos( + ) . We synthesise the sound segments of the carriers with the overlap smoothing algorithm. The amplitude of a carrier is 1. Therefore, the phase of the two segments can be aligned succinctly. The phase at the end of the current segment ( ) is estimated and taken as the initial phase of the next segment ( ). The overlap length is estimated to be 50 % of the wavelength at the centre frequency of the 1/3 octave. The length should be longer if the centre frequency exceeds 2 kHz, because there are fewer sampling points at high frequencies.
Then the two segments are concatenated by the overlap smoothing algorithm, as shown in Fig. 6. JOURNAL OF VIBROENGINEERING. AUGUST 2021, VOLUME 23, ISSUE 5

Synthesis of the envelope at high frequencies
The envelope of a high-frequency narrowband component obtained by the Hilbert transform is , which is positive. The amplitude modulation uses the orders of the fundamental frequency. We synthesise the sound segments of the envelopes with the overlap smoothing algorithm. The length of the overlap is larger to ensure a good match. Here, the overlap length is 150 % of the wavelength of the fundamental frequency. Since the envelope spectrum is still dominated by the fundamental frequency, peak points are set as pitch mark for phase alignment. There are at least two peak points in a fundamental period. Thus, the process is to search backwards for a peak point within 1.5 fundamental periods starting from the end of segment 1, and then search forwards for a peak point within 1.5 fundamental periods from the start of segment 2. Then the two segments are phase-aligned and smoothly overlapped with the peak points as pitch marks, as shown in Fig. 7.

Combining components
The three groups of components (the low-frequency part, the high-frequency carrier and the high-frequency envelope) are added to form the synthesised sound. As shown in Fig. 8, there is no apparent mismatch in the time domain.

Dataset
A four-cylinder gasoline engine was considered. Its speed was 1500 RPM, and the fundamental frequency was 50 Hz. The sampling frequency was 44 kHz. There were 4000 points in one segment, and the duration of a segment was about 90 ms. The two methods compared are direct synthesis, which directly connects the end of one segment to the start of the next segment, and PSOLA. To evaluate the differences, the synthesis algorithms were run 100 times in sequence, which results in a set of synthetic sounds lasting about 9 s.    Fig. 9 is the A-weighting spectrogram for a short-time Fourier transform (STFT) of non-synthetic sound recorded from the engine. In the spectrogram, the main frequency bands are 500-1000 Hz and 2000-3000 Hz. The distribution is discrete in both the frequency and time domains. Fig. 10 is the spectrogram of the directly synthesised sound. The sound segments are directly spliced together. In the spectrogram, there are few harmonics at low frequencies and many fish scales at high frequencies because of the phase mismatch.
As shown in Fig. 11, PSOLA can synthesise the low-frequency part well. However, it does not achieve a good match at high frequencies. There are fish scales in the spectrogram, and it is much different from Fig. 9.
The spectrogram in Fig. 12, produced by the proposed method, is very similar to Fig. 9, and it has nearly no fish scales.

Objective evaluation
The cepstrum distance is widely used to evaluate the quality of speech synthesis [30][31][32]. We calculate the cepstrum distance of amplitude modulation between the synthetic and non-synthetic engine sounds. The computation is performed through the following steps: (1) Weight is added to account for the sensitivity of human hearing.
(3) The amplitude envelope of the filtered signal is extracted.
(4) The cepstrum of the amplitude envelope is computed (Hann windows, 512 points per frame). (5) The distance between the synthetic and non-synthetic sound is calculated as: where is the sequence number of the point, ( ) is the cepstrum of the filtered synthetic sound and ( ) is the cepstrum of the filtered non-synthetic sound. Fig. 13 shows the distance as a function of frequency for the synthetic sounds compared with non-synthetic sound. A lower distance means a less difference. These results confirm the visual observations in the high-frequency range. The distance for the proposed method is significantly shorter than those for the other methods at frequencies over 500 Hz.

Subjective evaluation
Subjective assessment is necessary for the quality evaluation of the synthesised engine sound as well as the synthesised speech. [11,12,33]. Naturalness is the difference between synthetic and Distance(dB) natural sounds, and its popular index is mean opinion score (MOS) [34]. The final comparison was a subjective assessment of the sounds. Altogether, 21 people with normal hearing were recruited to evaluate the engine sounds. A listener is presented the pairs of car noise stimuli using loudspeakers in a soundproof room in which the sound level was lower than 38dBA. The engine sounds were played in a loop for 20 s by a loudspeaker. The listeners used a scale from 1 (extremely unnatural) to 7 (extremely natural). The mean opinion scores are computed from the assessment given by listeners. The results of this auditory experiment are summarised in Fig. 14. The non-synthetic sound is the most natural, and the directly synthesised sound is the least natural. The naturalness of the engine sound synthesised by the proposed method is significantly higher than for the sounds synthesised by the other two methods. The results of this subjective evaluation are consistent with the objective evaluation.

Conclusions
This paper presents an approach for engine sound decomposition and synthesis. The characteristics of engine sound are extracted by analysing the frequencies and amplitude modulation. The low-frequency part is represented by harmonics that depend on the engine speed. The high-frequency part is represented by narrowband carriers and envelopes, which are also harmonics that depend on the engine speed. The high-frequency part of the engine sound is not treated as stochastic noise, which is the key to improving the auralisation. The low-frequency part is obtained by a low-pass filter. The high-frequency part is filtered at 1/3 octave. A Hilbert transform is used to separate the carrier and envelope of the filtered high-frequency signals. The signals are concatenated by the overlap smoothing algorithm, which can reduce the phase and amplitude mismatch. The engine sound is then generated by combining the concatenated signals of the different frequency bands. This approach can synthesise engine sound samples of any length.
Subjective and objective evaluations of the method proposed were performed by comparing a raw engine sound with synthesised sounds. The spectrogram of the sound produced by the proposed method is very similar to the natural sound. There is a shorter cepstrum distance and a higher mean opinion score for the proposed method. They showed that the sound synthesised by the proposed method is closer to the raw engine sound.