3.Experiment_中国语音学报（第11辑）-QQ阅读男频历史网

书名：中国语音学报（第11辑）
作者名：中国社会科学院语言研究所主办
本章字数：465字
更新时间：2025-03-31 07:31:28

3.Experiment

The MOCHA-TIMIT corpus，which has synchronized acoustic-articulatory information，is used in the experiment.It includes one male speaker （msak0） and one female speaker （fsew0），each uttering 460 TIMIT sentences.Electromagnetic receiver coils are attached to 9 articulators in the midsagittal plane.They are velum （V），tongue dorsum （TD），tongue blade （TB），tongue tip （TT），lower jaw （LJ），upper lip （UL），lower lip （LL），and the references （REF） on nose ridge and upper jaw.The x and y coordinates are recorded，providing a total 18 channels of articulatory information.The data of the male subject （msak0） is used in this paper.Among those coils，the trajectories of V，TD，TB，TT，LJ，LL and UL are used in our experiment.The sampling frequencies are 16，000Hz for acoustic signal and 500Hz for the articulatory signal，respectively.

Figure 2 The positions of the EMA coils on the speaker's articulators

In preparing the experiment，the speech is segmented into frames by a Hanning window with the length of 25ms.Each speech frame is encoded by the log-energy and 12th-order MFCCs augmented with their delta and deltadeltas.The frame shift between consecutive frames is 10ms.The EMA data are smoothed with a Savitzky-Golay filter with the order of 3 and frame size of 21，and down-sampled to 100Hz to match the frame-rate of he acoustic features.

In our experiments we use a context window of 7 consecutive frames of acoustics feature as the input.As for the output，we use the EMA frame at the time instant corresponding to the middle frame of the contextual acoustic feature.The data is randomly partitioned into three sets：A validation set （45 utterances），a testing set comprising （45 utterances），and a training （370 utterances）.Both EMA and MFCC feature vectors are normalized by subtracting their global mean and dividing by their standard deviation of each dimension，respectively.

To measure the accuracy，the root mean-squared error （RMSE） and correlation coefficient，which are the most widely used measures for evaluating articulatory inversion performance，are adopted.RMS error gives an indication of the overall distance between two trajectories，while correlation indicates synchrony and similarity of shape.They are defined as：

where and x_i are the estimated and actual coordinate of coil at time instant i，respectively.

Two cost functions are used in training the batch normalize DNN， where L₁ is the least square loss and L₂ is weighted version of L₁.

where the weighting coefficient w_ij depends on the velocity v_i of current time instant.

本周热推：

像高手一样发言：七种常见工作场景的说话之道从零开始学公文写作（精装版）高能量姿势写作技法大全公文写作（第三版）