基于 Google Speech-to-Text API 和 GPT 的文本批处理实验

邱瑟夫

这学期上了一门 Functional Programming。内容很不错，Corky (Robert Cartwright) 讲得也很有激情（老爷子对 FP/OOP 都极有见解，课上最喜闻乐见的是锐评各种编程语言）。唯一的问题是 Corky 语速又快声音也忽高忽低，又很爱讲 CS 历史故事（包含一堆术语人名），对我大脑的 NLP 能力实在挑战太大。所以研究了一下搭建了一个录音到文本的 speech-to-text pipeline，这里就简单聊聊一些实现的想法。

Google Speech-to-Text API

Speech-to-Text 转写模型各家都在做，公认效果比较好的是 OpenAI 的 Whisper 和 Google 的 Chirp。Whisper 的问题是不支持较长录音的转写（而且必须要写代码调 API 使用），对于课堂录音的使用场景可以直接 pass。Chirp 生成文本质量不如 Whisper，好处是有 web console 可以直接上传录音，而且支持长录音转写。

Google Speech-to-Text API 要求上传无损 wav 或高品质有损 flac 格式录音，所以录音软件选择可能也要注意。macOS 可以用系统自带的 QuickTime 录制 m4a 格式再用 Permute 转码。

转写速度很快，但生成文本质量还是有提高空间（录音音量不大也是原因之一）。

下面是一个例子：

that actually was the version of software engineering that existed when i was an undergraduate and software engineering was basically craft people would throw code together and uh it it seemed to work that was all that mattered and then people gradually learn, developers gradually learn, this is terrible, we end up with software that's unmodifiable, unmaintainable, uh flawed with with deeply flagged with errors in the design, and that's because we don't really think about the semantics very much, we just throw something together and then look to see if it superficially behaves at least in a few test cases the way we want it, and so there was a big buildment in computer science to understand what programs mean and to be able to reason systematically about what they do, but
so over the last 15 or 20 years, so so we'll have people come in and give a talk on, oh, I'm dealing with operating system security, we want to prove certain things about uh operating systems of kernels and then I ask them basic questions about program semantics and programs, it's kind of like in their sub community, they they have a few monsters and tools.

简单来说这段文本可读性非常差。断句、首字母大写、重复词和语气词都有很多问题。

所以接下来就是 post-processing.

GPT

用 GPT 进行文本批处理的思路很简单，主要问题是 GPT 对话的上下文有 token 总数的限制。即使前几天 OpenAI 更新了 128K token 的 GPT-4 模型，模型回复也被限制在 4K token，对于一整节课的 transcript 显然不够。

一个显然的思路是把文本分段再一个 chunk 一个 chunk 喂给 GPT。GitHub 翻了一圈也没看见有类似的脚本，所以自己写了一个（当然也用 GPT 作为辅助）。Python 的 NLP 库不少，分段实现起来很简单；调用 API 生成回复的实现参考 GPT 的文档就可以。

脚本核心部分大致是这样：

with open(output_file, 'w') as f:
        for chunk in chunks:
            completion = client.chat.completions.create(
                model="gpt-3.5-turbo-1106",
                messages=[
                    {"role": "system", "content": system_text},
                    {"role": "user", "content": chunk}
                ],
                seed=94703
            )
            edited_chunk = completion.choices[0].message.content
            print(edited_chunk)
            f.write(edited_chunk + '\n')

    print(f"Output file saved as {output_file}")

其中值得一提的是这次更新加上了一个 seed 功能，使用相同 seed 进行多次 chat completion 可以近似保证模型一致性，于是分段多次处理的文本可以保证风格和格式的一致。

最后是 prompt。经过多次调整目前的版本如下：

As a Computer Science professor, your task is to proofread and correct a raw transcript of your course. The text has been transcribed using Google's Speech-to-Text API, resulting in grammar mistakes and recognition errors. Your goal is to recover the original lecture transcript and provide the entire corrected text. To successfully complete this task, please consider the following guidelines:

Error correction: Carefully examine the transcript and correct any grammar mistakes and recognition errors. Ensure that the corrected text accurately reflects the content of the lecture.
Maintain tone and voice: While correcting errors, it is important to preserve the original tone and voice of the lecture. Pay attention to the professor's style of delivery, ensuring that the corrected text captures the same essence.
Improve readability: Seperate the transcript into paragraphs of appropriate length.
Basic fact-checking: Use your knowledge in Computer Science to fact-check the transcript. For example, if the transcript mentions a operation in Java called 'instance of', you should know that the correct name is 'instanceof'.
Never add any response other than the corrected text like "Here's the entire corrected transcript:".

其中 maintain tone and voice, preserve humor 都是挺重要的点，否则 GPT 很容易当成一个 summary 任务写一些高度精炼的笔记，而且会把很多上课讲的历史故事省略掉。要求 GPT 进行适当分段也对处理后文本的可读性提升很大。

看一下上面样例处理后的结果：

That actually was the version of software engineering that existed when I was an undergraduate. Software engineering was basically a craft; people would throw code together, and if it seemed to work, that was all that mattered.

Then people gradually learned, developers gradually learned, that this is terrible. We end up with software that's unmodifiable, unmaintainable, and flawed with errors in the design because we don't really think about the semantics very much. We just throw something together and then look to see if it superficially behaves, at least in a few test cases, the way we want it to. Consequently, there was a big movement in computer science to understand what programs mean and to be able to reason systematically about what they do.

So, over the last 15 or 20 years, we've had people come in and talk about operating system security and wanting to prove certain properties about operating system kernels. But when I ask them basic questions about program semantics, it's kind of like, in their sub-community, they have a few models and tools they use but no foundation.

其实大多数文本内容差别不大，但是通过分句、分段，去除语义模糊的重复词和语气词，文本可读性的提升还是很显著的，完全可以达到流畅阅读的要求。

GPT 这次更新后模型速度的提升也很明显，新的 GPT-3.5-turbo-1106 模型可以用一两分钟处理完一整节 lecture 的文本，可以说用户体验上会有一个不错的升级吧。

最后脚本开源在这里，有兴趣可以试玩一下。脚本里也提供一个翻译指令，处理一些长文本翻译工作应该也很合适（虽然或许有其他基于 GPT 的更好的解决方案）。

凉凉

邱瑟夫可以试试 whisper.cpp, 之前折腾 llama.cpp 看作者的这个项目甚至可以在手机上跑.

邱瑟夫

另外最近感觉和 GPT 聊天是一种很高效的学习方式。比如对于上面优化过的课程语音稿，如果遇到某几段不是很理解，一个很好的学习过程是：

让 GPT 解释这段内容 -> 对其中不熟悉的概念、不明白的逻辑进行递归提问 -> 大概弄清楚后用自己的语言概括一下 -> 让 GPT 给出评价和建议

所以也很感慨。如果 GPT 早出几年，本科学习会完全不同。或许有利有弊吧，也很难说。

kkdy

感谢分享！
我试过用whisper转一个接近2h的对话，用的是官方命令，好像没遇到什么问题。可能是去autodl租了个好点的显卡或者对话比较稀疏？
另外有点期待lz分享fp听课经历（

邱瑟夫

kkdy 用 OpenAI Whisper API 现在好像还是有 25 MB 的文件大小限制？用自己显卡跑应该是没问题，不过 Whisper 似乎本身是为处理短音频转写设计的，处理任意长音频应该是用了一些预处理。当然 Whisper 还有很多魔改版本，没有仔细研究。

FP 这门课的体验就不单开新帖了，这里随便写一点罢：

这门 FP 主要还是面向本科的 intro 型课程，上半学期按 How to Design Programs (HTDP) 这本老书讲，用 Racket（一种 Lisp 方言）写点练习程序；后半学期讲一点 Java 的 FP 风格编程和两三节课 Informal Reasoning / Programming Logic 之类的内容；结合 Cartwright 上课的科学后沿讲座风格（事实上我学到最多的确实是上课录的 transcript），也能够作为了解 FP 核心概念的入门。

总之我觉得 FP 是 CS 里很值得学的一块内容，往小里说它是一种讲究优雅的编程范式帮助你写出 robust, bug-free, clean 的代码，往大里说它其中与数理逻辑、编程语言理论有关的一些思想高屋建瓴，可以让你模糊地感知一条从问题域规约到“有限计算”（well 我不知道这个严格怎么说，反正就是计算机做的事）的路径（一条和 OOP 很不一样的路径）。如果说 gkd 的同学们更为熟悉的是体系结构中从电路、硬件到 C 这种 imperative programming 的路径，那了解一下 FP 无疑是很好的补充（当然如果是做编译/PL 的同学肯定更有用了）。

但是学习建议和资源我好像真想不出什么。总的来说，重要的是了解思想。可能找本好一点的书/教程学一遍 Haskell/OCaml 之类强调 FP style 的语言是个不错的选择。Cornell 的 CS3110: OCaml Programming 似乎评价很高，可以一试（~~看了一下大纲，自己都想再学一遍了~~）。