How Zoom Can Make Videoconferencing More Human-Friendly

The popular video-conferencing platform could be the world’s first example of neuro-safe technology.
The good news about COVID-19 is that being forced into physical social separation and remote interaction is teaching people how precious real life is, and which remote technologies preserve reality best. Of those, videoconferencing has the most potential to do good, or harm, because it merges our highest-bandwidth external senses of sight and sound. The four biggest platforms — Skype, GoToMeeting, Google Hangouts and Zoom — are all tempted more by making money than by connecting human beings. Yet any technology will benefit humans only if it obeys the laws of nature governing how nervous systems interact.

I’m familiar with Zoom, and I believe it might pull this off. From what I’ve heard, almost uniquely in Silicon Valley, Zoom has a corporate culture, founder and workforce more people-centered than money-centered. So, uniquely, Zoom might be able to avoid the siren-song of giving customers what they say they want and instead give humans what nature says we need. Implementing that principle will require saying “no” to the short-term wishes of both customers and investors, and saying “yes” to nature’s long-term plans.

Making the Best of Social Distancing


In particular, operating Zoom as a public utility optimally connecting human beings with each other — as opposed to optimally extracting revenue from them — will require principled commitments to audio fidelity, remote resonance, algorithmic neutrality, non-adversarial business models and videoconference etiquette. Lucky for us, Zoom has already started on some of those projects. If this works, people will look forward to Zoom calls as “special,” the way they used to look forward to long-distance phone calls back in the day. And global loneliness might finally, finally decrease.

Zoom is on the right track. Because of global work-from-home and school-from-home rules due to the novel coronavirus, Zoom’s user base recently grew twenty-fold, from 10 million to 200 million, most of whom aren’t even business customers. 

I’m one of them. In the last weeks, I’ve participated in Zoom-enabled parties, yoga classes and meditations. Serving as a real-time gathering spot makes Zoom the closest to a global social lifeline we have, and the technology best poised to reconnect human nervous systems according to the laws of nature. (This conclusion might seem odd, given that I’ve spent the last several years stumping for non-screen human connection.)

Audio Fidelity: Stereo and Microtime

The challenge: Humans connect emotionally through unconscious timing signals that can’t be noticed, digitized or monetized.

It is beyond question that the human nervous system creates perception and trust from ultra-high-precision interactions (see: Sensory Metrics of Neuromechanical Trust).  Likewise, humans’ remarkable abilities to hear where a sound came from depends on microsecond sound signals, as do our abilities to read emotional nuance. Those “microtime” signals are why LPs and copper-wire phones create so much better emotional experiences than CDs and digital audio.

These facts create three problems for Zoom. First, Zoom’s core brand is not audioconferencing but videoconferencing, so people using Zoom naturally pay more attention to screens than sound, although they should do the opposite because sound is wired deeper into us than screens. Second, computer sound as digitized by cheap built-in microphones is nothing like the sound from a good freestanding microphone. Third, while the sound from a good stereo microphone pair has much higher quality than from just one, Zoom’s most recent software release paradoxically makes stereo sound harder to use. I hope that decision is reversed soon because audio connection synchronizes people better than video and stereo synchronizes better than monoaural.

My back-of-the-envelope calculations suggest that the single improvement of using stereo microphones, all on its own, would increase human re-synchronization at least tenfold, merely due to better audio signal quality. That solution is available to anyone for about $20. There is one other semi-secret sauce solution — a proprietary analog circuit that approximately reconstitutes the microtime structure of the original source, even after that structure has been erased by digitization.

I have been experimenting with one such circuit courtesy of the patent owners (US 7,564,982). Most simply, this circuit measures the left-right channel microtime difference, amplifies it and re-inserts it into the headphones or speaker pair. To me, it sounds like the source is a living breathing person nearby, as if whispering next to me in the dark. That personal experience, along with biophysical understanding, tells me that such microtime amplification could improve remote connection dramatically.

Algorithmic Transparency: No Tracking or Photoshopping

The challenges: Enhanced self-presentation undermines communication, while eliminating tracking improves communication.

The baseline protocol for human communication was burned into our nervous systems way back in paleo times, before clothes and words. Everyone could see every inch of your body and hear your every grunt, and you couldn’t do anything to stop it. Contrast that case of “too much information” with Apple’s technology called “Facetime effects,” the image-processing trickery providing extraordinarily unnatural control over users’ appearance, all the way to replacing oneself with a boring but attractive cartoon avatar.

The problem is that if everyone gets to hide parts of themselves, then no one gets any honest information, and authenticity degrades into mere performance, absent genuine signals. Cartoon communication isn’t human communication, even if it’s what each separate individual might like to do. 

There was no privacy in paleo times, but also no recording and tracking. Paleo people didn’t even have words or cave paintings to record anything, much less up-to-the-millisecond biometric data including your gaze, heartbeat, skin temperature and anxiety level. Humans communicate most naturally, and trustingly, when they know they are not being recorded. Zoom has already been in trouble over privacy concerns, and it has responded by disabling invisible data-tracking and attention-tracking technologies.

On the visible user interface, Zoom is doing two things right and one wrong. On self-photoshopping, for example, Zoom allows only modest airbrush-like “touch up” effects, powerful enough to let someone feel comfortable enough, in close-up videos under bright lights, not to worry about makeup. Minor algorithmic makeup makes real facial expressions easier for everyone to see, so it’s just the right amount. But self-photoshopping could go too far, for example, if customers were offered a powerful “attractive and engaged” appearance via paid algorithmic trickery. (Once a platform starts monetizing fakery, it’s game over for an ecosystem of authentic communication).

Zoom users can also airbrush their backgrounds, using a virtual green-screen to block views of messy kitchens. That means you don’t need to clean up the house before your call, which is also just the right amount of user-control. Unfortunately, Zoom allows users to replace messy kitchens with moving backgrounds, such as flames, which on the Zoom interface distract horribly from the grid of tiny, barely-visible human faces (in front of the flames) that I’m trying to look at. Gratuitous moving backgrounds are a perfect example of how a legitimate preference of one user undermines communication for everyone.

Remote Resonance: Winner-Takes-All Audio vs. Symmetry

The challenge: Unlike “presentations” (such as webinars) in which one person talks and everyone else listens, human social resonance requires all-to-all transmission of subconscious signals.

Zoom’s current platform is designed for broadcast. When one person speaks, that sound stream is automatically selected for everyone to hear, while all other microphones are automatically muted. That’s the perfect solution for one-way communications.

But humans are two-way because we resonate. Or at least we try to. On my Zoom-enabled “group meditation,” I attempted to lead a minute’s worth of what primatologists call co-vocalizing, or what yoga people call “OM-ing.” I would chant a long vowel like “ahhh… ohhh… mmmm,” and in principle the others would hum along.  But it didn’t work. First, I couldn’t hear them because, of course, their microphones had been turned off while I was humming the sound.

But, weirdly, they couldn’t hear me either. It turns out that Zoom’s audio algorithm only detected a long, boring hum from my own microphone, decided the hum was background noise and then canceled it. So, my fellow meditators saw me with eyes closed and open mouth, yet they heard nothing. My own humming sound had been automatically erased. So much for interpersonal resonance.

A solution promoting resonance would be for Zoom to include a “resonance mode,” in which everyone’s microphone is on just a little bit, with no single sound stream dominating. The exact opposite of the current default, and for the exact opposite purpose: for unifying and synchronizing vibrations instead of separating spoken words.

I am collaborating with one team dedicated to human sonic resonance, the people running the Integratron “sound bath” center in the California desert. We are hoping to find ways to link resonant experiences like their sound baths remotely using stereo audio, Zoom and the microtime amplifier circuit.

A Non-Adversarial Business Model

The challenge: When carriers like Zoom pay for variable bandwidth but collect fixed subscription revenue, perverse financial incentives reduce the bandwidth customers receive and thus damage human communication.

Communication doesn’t need to be so bad. Over 40 years ago, even long-distance calls connected people well because voices were carried by dedicated copper wires the entire way, with an implicit service-level agreement of microtime phase fidelity. That was expensive, so Ma Bell invented computers to digitize and packetize voices, thus birthing much of the computer revolution. I was there: In 1985, during “divestiture,” I worked at ATT Bell Labs Murray Hill.

Once human bandwidth could be compressed into more cheaply recognizable packets, the race was on to minimize network bandwidth costs by ever-more-efficient voice compression. Unfortunately, that dynamic creates perverse network incentives to reduce bandwidth between communicating humans, although the humans themselves need as much bandwidth as possible. That incentive structure nearly guarantees that our (expensive) need for high-bandwidth interaction will fall victim to the network’s ever-present need for lower costs.

To operate in the best long-term interests of human communication — as opposed to any short-term metrics, especially monetary ones — Zoom needs to establish a long-term revenue model designed to enhance human communication. That is, a model which provides as much bandwidth as people need, in the form they need it, with transparently auditable metrics to prove it’s working. No one knows the structure of such a business yet, but that’s what innovation is for.

Better Videoconference Etiquette

The challenge: Human conversational habits evolved for in-person interaction and fail in various ways through screens.

Attending to screens for hours on end is really hard on us. It also doesn’t work very well because screen interaction is so unnatural. The thousands-fold discrepancy between our high-bandwidth 3D needs and the puny trickle of pixelated “content” is why telecommuting is so hard. Our social instincts need to know who said what, who laughed and who stayed silent. On video calls, it’s hard enough just to hear the words at all.

Here’s one example of rules of the road (aka “etiquette”) that might keep our conversations from crashing: stop looking at faces and concentrate on audio. 

Here’s why. At first, the video image of someone talking is the perfect way to recognize their face, mannerisms and mood, and to prove to yourself that this is a real live person talking. But once that truth is established, and you trust them, it makes more sense to close your eyes and listen to the words than to look at their face, because our circuits synchronize much faster on audio frequencies (milliseconds) than on screen refresh rates (tens of milliseconds).

Nature’s rules for optimum communication tell us to start with video, then move to audio while checking a face only occasionally. As long as everyone agrees on that solution, no one will even worry if you’re not looking at them on-screen. And that reduced expectation of on-screen “performance,” more than anything, will let people relax during video exchanges, which are one of the weirdest human interactions ever invented by humans.

Let’s hope we learn how to use these weird tools right and that their makers make them right for us to use.

*[Big tech has done an excellent job telling us about itself. This column, dubbed Tech Turncoat Truths, or 3T, goes beyond the hype, exploring how digital technology affects human minds and bodies. The picture isn’t pretty, but we don’t need pretty pictures. We need to see the truth of what we’re doing to ourselves.]

