Morphing Stage
Now we shall give a detailed account of how the morphing
process is carried out. The overall aim in this section is to make the smooth
transition from signal 1 to signal 2. This is partially accomplished by the 2D
array of the match path provided by the DTW. At this stage, it was decided
exactly what form the morph would take. The implementation chosen was to
perform the morph in the duration of the longest signal. In other words, the final morphed speech
signal would have the duration of the longest signal. In order to accomplish
this, the 2D array is interpolated to provide the desired duration.
Signal Acquisition
Before any processing can begin, the sound signal that is
created by some real-world process has to be ported to the computer by some
method. This is called sampling. A fundamental aspect of a digital signal (in
this case sound) is that it is based on processing sequences of samples. When a
natural process, such as a musical instrument, produces sound the signal produced
is analog (continuous-time) because it is defined along a continuum of times. A
discrete-time signal is represented by a sequence of numbers - the signal is
only defined at discrete times. A digital signal is a special instance of a
discrete-time signal - both time and amplitude are discrete. Each discrete
representation of the signal is termed a sample.
Introduction
Voice morphing means the transition of one speech signal
into another. Like image morphing, speech morphing aims to preserve the shared
characteristics of the starting and final signals, while generating a smooth
transition between them. Speech morphing is analogous to image morphing. In
image morphing the in-between images all show one face smoothly changing its
shape and texture until it turns into the target face. It is this feature that
a speech morph should possess. One speech signal should smoothly change into
another, keeping the shared characteristics of the starting and ending signals
but smoothly changing the other properties.
Combination of the Pitch peak information
As stated above, in order to produce a satisfying morph, it
must have just one pitch. This means that the morph slice must have a pitch
peak, which has characteristics of both signal 1 and signal 2. Therefore, an
artificial’ peak needs to be generated to satisfy this requirement. The
positions of the signal 1 and signal 2 pitch peaks are stored in an array
(created during the pre-processing, above), which means that the desired pitch
peak location can easily be calculated.
Future Scope
There are a number of areas in which further work should be
carried out in order to improve the technique described here and extend the
field of speech morphing in general. The time required to generate a morph is
dominated by the signal re-estimation process. Even a small number (for
example, 2) of iterations takes a significant amount of time even to
re-estimate signals of approximately one second duration. Although in speech
morphing, an inevitable loss of quality due to manipulation occurs and so less
iteration are required, an improved re-estimation algorithm is required.
Abstract
Voice morphing means the transition of one speech signal
into another. The new morphed signal will have the same information content as
the two input speech signals but a different pitch, which is determined by the
morphing algorithm. To do this, each signal's information has to be converted
into another representation, which enables the pitch and spectral envelope to
be encoded on orthogonal axes. Individual components of the speech signal are
then matched and the signal’s amplitudes are then interpolated to produce a new
speech signal.
Conclusion
The approach we have adopted separates the sounds into two
forms: spectral envelope information and pitch and voicing information. These
can then be independently modified. The morph is generated by splitting each
sound into two forms: a pitch representation and an envelope representation.
The pitch peaks are then obtained from the pitch spectrograms to create a pitch
contour for each sound. Dynamic Time Warping of these contours aligns the
sounds with respect to their pitches. At each corresponding frame, the pitch,
voicing and envelope information are separately morphed to produce a final
morphed frame. These frames are then converted back into a time domain waveform
using the signal re-estimation algorithm.
No comments:
Post a Comment