3.Experiment

The MOCHA-TIMIT corpus,which has synchronized acoustic-articulatory information,is used in the experiment.It includes one male speaker (msak0) and one female speaker (fsew0),each uttering 460 TIMIT sentences.Electromagnetic receiver coils are attached to 9 articulators in the midsagittal plane.They are velum (V),tongue dorsum (TD),tongue blade (TB),tongue tip (TT),lower jaw (LJ),upper lip (UL),lower lip (LL),and the references (REF) on nose ridge and upper jaw.The x and y coordinates are recorded,providing a total 18 channels of articulatory information.The data of the male subject (msak0) is used in this paper.Among those coils,the trajectories of V,TD,TB,TT,LJ,LL and UL are used in our experiment.The sampling frequencies are 16,000Hz for acoustic signal and 500Hz for the articulatory signal,respectively.

Figure 2 The positions of the EMA coils on the speaker's articulators

In preparing the experiment,the speech is segmented into frames by a Hanning window with the length of 25ms.Each speech frame is encoded by the log-energy and 12th-order MFCCs augmented with their delta and deltadeltas.The frame shift between consecutive frames is 10ms.The EMA data are smoothed with a Savitzky-Golay filter with the order of 3 and frame size of 21,and down-sampled to 100Hz to match the frame-rate of he acoustic features.

In our experiments we use a context window of 7 consecutive frames of acoustics feature as the input.As for the output,we use the EMA frame at the time instant corresponding to the middle frame of the contextual acoustic feature.The data is randomly partitioned into three sets:A validation set (45 utterances),a testing set comprising (45 utterances),and a training (370 utterances).Both EMA and MFCC feature vectors are normalized by subtracting their global mean and dividing by their standard deviation of each dimension,respectively.

To measure the accuracy,the root mean-squared error (RMSE) and correlation coefficient,which are the most widely used measures for evaluating articulatory inversion performance,are adopted.RMS error gives an indication of the overall distance between two trajectories,while correlation indicates synchrony and similarity of shape.They are defined as:

where and xi are the estimated and actual coordinate of coil at time instant i,respectively.

Two cost functions are used in training the batch normalize DNN, where L1 is the least square loss and L2 is weighted version of L1.

where the weighting coefficient wij depends on the velocity vi of current time instant.