这学期上了一门 Functional Programming。内容很不错,Corky (Robert Cartwright) 讲得也很有激情(老爷子对 FP/OOP 都极有见解,课上最喜闻乐见的是锐评各种编程语言)。唯一的问题是 Corky 语速又快声音也忽高忽低,又很爱讲 CS 历史故事(包含一堆术语人名),对我大脑的 NLP 能力实在挑战太大。所以研究了一下搭建了一个录音到文本的 speech-to-text pipeline,这里就简单聊聊一些实现的想法。
Google Speech-to-Text API
Speech-to-Text 转写模型各家都在做,公认效果比较好的是 OpenAI 的 Whisper 和 Google 的 Chirp。Whisper 的问题是不支持较长录音的转写(而且必须要写代码调 API 使用),对于课堂录音的使用场景可以直接 pass。Chirp 生成文本质量不如 Whisper,好处是有 web console 可以直接上传录音,而且支持长录音转写。
Google Speech-to-Text API 要求上传无损 wav 或高品质有损 flac 格式录音,所以录音软件选择可能也要注意。macOS 可以用系统自带的 QuickTime 录制 m4a 格式再用 Permute 转码。
转写速度很快,但生成文本质量还是有提高空间(录音音量不大也是原因之一)。
下面是一个例子:
that actually was the version of software engineering that existed when i was an undergraduate and software engineering was basically craft people would throw code together and uh it it seemed to work that was all that mattered and then people gradually learn, developers gradually learn, this is terrible, we end up with software that's unmodifiable, unmaintainable, uh flawed with with deeply flagged with errors in the design, and that's because we don't really think about the semantics very much, we just throw something together and then look to see if it superficially behaves at least in a few test cases the way we want it, and so there was a big buildment in computer science to understand what programs mean and to be able to reason systematically about what they do, but
so over the last 15 or 20 years, so so we'll have people come in and give a talk on, oh, I'm dealing with operating system security, we want to prove certain things about uh operating systems of kernels and then I ask them basic questions about program semantics and programs, it's kind of like in their sub community, they they have a few monsters and tools.
简单来说这段文本可读性非常差。断句、首字母大写、重复词和语气词都有很多问题。
所以接下来就是 post-processing.
GPT
用 GPT 进行文本批处理的思路很简单,主要问题是 GPT 对话的上下文有 token 总数的限制。即使前几天 OpenAI 更新了 128K token 的 GPT-4 模型,模型回复也被限制在 4K token,对于一整节课的 transcript 显然不够。
一个显然的思路是把文本分段再一个 chunk 一个 chunk 喂给 GPT。GitHub 翻了一圈也没看见有类似的脚本,所以自己写了一个(当然也用 GPT 作为辅助)。Python 的 NLP 库不少,分段实现起来很简单;调用 API 生成回复的实现参考 GPT 的文档就可以。
脚本核心部分大致是这样:
with open(output_file, 'w') as f:
for chunk in chunks:
completion = client.chat.completions.create(
model="gpt-3.5-turbo-1106",
messages=[
{"role": "system", "content": system_text},
{"role": "user", "content": chunk}
],
seed=94703
)
edited_chunk = completion.choices[0].message.content
print(edited_chunk)
f.write(edited_chunk + '\n')
print(f"Output file saved as {output_file}")
其中值得一提的是这次更新加上了一个 seed 功能,使用相同 seed 进行多次 chat completion 可以近似保证模型一致性,于是分段多次处理的文本可以保证风格和格式的一致。
最后是 prompt。经过多次调整目前的版本如下:
As a Computer Science professor, your task is to proofread and correct a raw transcript of your course. The text has been transcribed using Google's Speech-to-Text API, resulting in grammar mistakes and recognition errors. Your goal is to recover the original lecture transcript and provide the entire corrected text. To successfully complete this task, please consider the following guidelines:
- Error correction: Carefully examine the transcript and correct any grammar mistakes and recognition errors. Ensure that the corrected text accurately reflects the content of the lecture.
- Maintain tone and voice: While correcting errors, it is important to preserve the original tone and voice of the lecture. Pay attention to the professor's style of delivery, ensuring that the corrected text captures the same essence.
- Improve readability: Seperate the transcript into paragraphs of appropriate length.
- Basic fact-checking: Use your knowledge in Computer Science to fact-check the transcript. For example, if the transcript mentions a operation in Java called 'instance of', you should know that the correct name is 'instanceof'.
- Never add any response other than the corrected text like "Here's the entire corrected transcript:".
其中 maintain tone and voice, preserve humor 都是挺重要的点,否则 GPT 很容易当成一个 summary 任务写一些高度精炼的笔记,而且会把很多上课讲的历史故事省略掉。要求 GPT 进行适当分段也对处理后文本的可读性提升很大。
看一下上面样例处理后的结果:
That actually was the version of software engineering that existed when I was an undergraduate. Software engineering was basically a craft; people would throw code together, and if it seemed to work, that was all that mattered.
Then people gradually learned, developers gradually learned, that this is terrible. We end up with software that's unmodifiable, unmaintainable, and flawed with errors in the design because we don't really think about the semantics very much. We just throw something together and then look to see if it superficially behaves, at least in a few test cases, the way we want it to. Consequently, there was a big movement in computer science to understand what programs mean and to be able to reason systematically about what they do.
So, over the last 15 or 20 years, we've had people come in and talk about operating system security and wanting to prove certain properties about operating system kernels. But when I ask them basic questions about program semantics, it's kind of like, in their sub-community, they have a few models and tools they use but no foundation.
其实大多数文本内容差别不大,但是通过分句、分段,去除语义模糊的重复词和语气词,文本可读性的提升还是很显著的,完全可以达到流畅阅读的要求。
GPT 这次更新后模型速度的提升也很明显,新的 GPT-3.5-turbo-1106
模型可以用一两分钟处理完一整节 lecture 的文本,可以说用户体验上会有一个不错的升级吧。
最后脚本开源在这里,有兴趣可以试玩一下。脚本里也提供一个翻译指令,处理一些长文本翻译工作应该也很合适(虽然或许有其他基于 GPT 的更好的解决方案)。