Lol lmao whenever I visit another country I get reminded how much the UX for translation apps sucks. There's definitely atleast $100M reward sitting for whoever solves this.
Not handing over a device to the other person is absolutely critical for good UX.
Keep all the mics, screens, etc on your person, your hand should not have to physically put stuff on the other person's body or in their hand.
You should not require the other person to take a phone out of their pocket.
When you meet strangers, you don't have the negotiating power to inconvenience them in this way.
This actually means option 3 and 7 seem like the best options. Also I haven't listed VR glasses as an option, I think it should be.
For input, definitely both people should speak not type.
You definitely need a mic strong enough that it is clipped to your tshirt (not the other person's) but can still catch the other person's voice, even if they are 1 or 2 metres away ideally.
For output, I am still unsure about screen text output versus audio output.
I personally prefer screen output. Some other people I know also prefer screen output. But also some people don't have practice with reading and they would prefer audio output. I think education level is maybe biggest predictor of this - uneducated people also know how to read usually but they lack practice and it is not the default for them. If you are comfortable with reading, screen output is definitely faster communication (and speed is extremely important).
I don't know how to setup screens such that all the screens are on my body (not other person's body), and I don't have to take phone out of my pocket, and both people can see the screens. The only solution that comes to mind is to have one screen be literally VR glasses (for me to read) and to have one screen literally hanging on my chest (for other person to read). LMFAO. This solution is so ridiculous I want to now go build it, for the lolz.
For audio output, I think literally option 3 listed below could work?
AAAA now I really want to build this. Maybe I will
2026-04-03
Update
I haven't gone through this writeup so it might be completely incorrect by now
Important point I didn't realise when I made this writeup
No matter what method you use, there will always be a 2-3 second delay because the speaker needs to say atleast 5-10 words before you start translation.
If the source language does Subject-Verb-Object conjugation (example: English) and the destination language does Subject-Object-Verb conjugation (example: Hindi), then the speaker has to finish speaking the full sentence before the translation can play the audio translation for Object then Verb.
This delay alone is fatal in contexts where the speaker isn't okay with a delay.
(Update 2026-04-28) To be more precise, assume a speaking speed of 150 words/min or 2.5 words/second. Assume the speaker needs to speak 5 words before grammatical equivalent can be translated in the other language. This is a delay of 2 seconds. The exact delay here will vary between 0 and maybe 4 seconds depending on speaking speed and the structure of the sentence being spoken.
Other important point I realised but underestimated when I made this writeup
mobile phone microphones universally suck for some reason
(Update 2026-04-28) In particular, they are bad at catching audio at a distance of above 1 metre.
2024-03-31
UX possibilities for realtime translate app
Problems
Interruption problem for the person speaking:
If the speaker is loud it can disturb the person who is speaking. They will have to wait for the speaker to finish its TTS and speaking after each sentence, before they speak the next sentence. This reduces the benefit of this app working in real time in the first place.
Noise cancellation for the person listening:
Person listening doesn’t want to hear translated voice and original voice overlapping at the same time.
Person listening doesn’t want to wait for the person speaking to complete each sentence and then hear the translated voice, and only after this let the person speak the next sentence. This causes interruption for the person speaking. Should be possible for a person to speak many sentences continuously and for the other person to follow along, for example.
Background noise might contain useful information such as somebody else calling for both of you, don’t want to silence it completely.
If using earphone speakers, app can dynamically control volume of translated voice, volume of original voice and volume of background noise. So the app can silence original voice and only play translated voice, for example.
Noise cancellation
How good is noise cancellation of high quality earphones? Versus headphones. With headphones it’s good, I’ve not tried expensive earphones yet.
Setup cost
initial UX hassle required to start a conversation with someone new
Reading versus listening to translations
Can read translations or use TTS and listen to them. I personally prefer reading because it’s faster than listening, plus I anyway read a lot so I’m comfortable with this.
Maybe this is a personal preference, I’m unsure. Some users might want to read and others might want to listen
Chunking (idk what this is called lmao)
Will need to decide how to make chunks for the TTS when translating in real-time. For instance, imagine a person is speaking a few sentences rapidly.
For text translation, every time the person speaks another word, you can rerun forward passes of the whole sentence so far, and edit the text shown on the screen. (Assuming sufficient GPU) Each new word can change the context and the meaning of the previous words, and hence the translation of the entire sentence or clause.
For translation + TTS, you can still rerun the whole forward pass every time a new word comes in. But you have to take decision on when to send this translation to the TTS, because once narrated aloud you can’t correct it as quickly.
Group setting
More than two people are part of the conversation.
I have not yet thought through this at all
Possible Solutions
Releasing proprietary hardware (like pendant) not ideal imo because
Cons
harder to get users to buy device compared to install app
if someone else releases a good app (no device needed), they will win
Pros
from profit pov, once someone buys a device, they’re more sticky and won’t switch platforms easily
UX possibilities without buying separate hardware:
Both people wear wireless earphones (mic+speaker)
Pros
No interruption for either person
Noise cancellation
Can extend to group setting with N earphones
Cons
Setup cost: if they’re a stranger and not wearing earphones, bad UX to ask them to wear earphones on-the-spot
Setup cost: if they’re wearing earphones, mild UX annoyance to pair them
Ignoring setup cost this is definitely the best option.
You wear a mic+speaker hanging from your neck or clipped to your T-shirt. Both people share it
Pros
No setup cost
Cons
Interruption for both people
Wear wireless earphone (mic+speaker) and wear another mic+speaker on your T-shirt or neck for the other person to use. Other person wears nothing
Pros (unsure if technically feasible)
No interruption for either person
Noise cancellation for you but not other person.
No setup cost
Cons
Wearing two things slightly annoying UX? Unsure
Unsure if technically feasible:
When you are speaking, the neck speaker should be louder than you to drown your voice so other person hears only translation, and your earphone speaker should silence this so you can’t hear the translation.
When other person is speaking, your earphone speaker should silence it so you hear only translation not their original voice
Wear wireless earphone (mic+speaker) and hand your phone to the other person for separate mic+speaker
Pros (unsure if technically feasible, same as previous point 3)
No interruption for either person
Noise cancellation for you but not other person
No setup cost
Cons
Increasing physical distance between both people will make it easier to ensure they hear the translated voice but not your original voice
Both people take out phones and use their respective phones for mic+screen. No speakers, no TTS.
Pros:
Both people read translations instead of hearing, this can be faster
No interruption for either person
Cons:
Both people’s eyes focused on screens not other person or environment
Setup cost: need to take out phone from pocket and ask the other person to, then pair them, slightly annoying UX
Doesn’t work if other person does not own smartphone. (Unless you carry a second device with you for this situation)
Just use your phone for screen+mic. Shared by both people. Pass your phone around, or keep it in the centre of the table, or in front of you. No earphones, no speakers or TTS.
Pros:
Both people read translations instead of hearing
No interruption for either user
No setup cost
Cons:
Both people’s eyes focused on screen
Screen will not be at a good angle of viewing for both people at the same time unless they sit side by side
Mic pickup may not be good depending on where the phone is placed. May need to pass it around (this is as bad as interruption)
(Proprietary device) Sell a device with screen that displays in all directions, and a really good mic that can pick up from large distance (I need to check how much this costs and if it’s technically feasible). Place it on a table in the centre. Both users can speak into it and read translations
Pros
Both people read translations not hear
No interruption for either user
No setup cost for each conversation
Can extend to group setting easily
Cons
Both people’s eyes focused on screen
Proprietary device, might be expensive - user needs to buy
Conclusion
1 seems best among friends?
4 seems best for new interactions? Does 4 actually work?
6 seems best for some users?
Super unsure about all this
For group setting I don’t know yet
Subscribe
Enter email or phone number to subscribe. You will receive atmost one update per month