Secure Remote VoIP Phones: TLS, SRTP, and VPN Design Choices

Dawn Phillips
1 Jul 2026
0

You have a team working from home. They use IP phones or softphones to talk to clients. The connection goes through their home Wi-Fi, the public internet, and then back to your office or cloud PBX. It sounds simple until you realize that every word spoken is traveling in plain text across networks you do not control. Hackers can listen in. Competitors can intercept calls. This is why securing remote VoIP Voice over Internet Protocol systems that transmit voice data as digital packets over the internet rather than traditional phone lines is no longer optional-it is a baseline requirement.

The problem? There are three main ways to secure these calls: Transport Layer Security (TLS), Secure Real-time Transport Protocol (SRTP), and Virtual Private Networks (VPNs). Each one works differently. Each one has trade-offs in speed, cost, and complexity. If you pick the wrong combination, your calls will sound like they are coming from the bottom of a well-or worse, they will drop entirely. Let’s break down exactly how these technologies work together so you can build a secure, high-quality voice network without guessing.

How VoIP Calls Actually Travel

To understand security, you first need to understand what is moving. A VoIP call is not one single stream of data. It is split into two distinct parts: signaling and media. Think of signaling as the handshake. It sets up the call, tells the other phone who is calling, and handles hang-ups. Media is the actual audio-the sound of voices.

SIP Session Initiation Protocol, an application-layer protocol used to initiate, maintain, and terminate real-time sessions including voice and video calls over IP networks, standardized in RFC 3261, handles the signaling. RTP Real-time Transport Protocol, a standard format for delivering audio and video over IP networks, handling timing reconstruction and loss detection carries the media. Without protection, both travel openly. Anyone sitting on the same network segment, or even further upstream at an ISP level, can sniff these packets. You see the phone numbers dialed, the duration of calls, and yes, the audio itself.

This separation is critical because it dictates how we secure them. You cannot protect signaling with the same tool you use for media. That is where TLS and SRTP come in. They operate at different layers of the network stack, targeting specific threats.

TLS: Locking Down the Signaling

If someone wants to spy on your business, they start by looking at metadata. Who are you talking to? How often? What extensions exist in your system? TLS Transport Layer Security, a cryptographic protocol designed to provide communications security over a computer network, widely used to secure web browsing and email encrypts the SIP signaling layer. By default, unencrypted SIP runs on UDP port 5060. When you enable TLS, it moves to TCP port 5061. This creates an encrypted tunnel for all call setup messages.

TLS does more than just hide numbers. It authenticates the server. When your remote phone connects to your PBX, TLS ensures that the phone is actually talking to your legitimate server and not a fake one set up by an attacker to steal credentials. Modern deployments rely on TLS 1.2 or 1.3. These versions use strong ciphers like AES-256 and SHA-256. Older SSL protocols are dead weight and should be disabled everywhere.

Here is the catch: TLS only protects the signaling. It does not touch the audio. If you use TLS but leave the media open, hackers still cannot see who you called, but they can record the conversation. That is why TLS alone is insufficient for full security.

Animated audio wave protected by green encryption shields against static monsters.

SRTP: Encrypting the Voice

Now let’s look at the audio. SRTP Secure Real-time Transport Protocol, defined in RFC 3711, which adds encryption, message authentication, and replay protection to RTP streams was designed specifically for this job. Defined in RFC 3711 in 2004, SRTP wraps around RTP to provide confidentiality, integrity, and replay protection. It uses AES encryption-usually AES-128 in Counter Mode-to scramble the audio packets. It also uses HMAC-SHA1 to verify that packets have not been tampered with during transit.

SRTP is lightweight. Unlike some heavy encryption methods, it adds very little overhead. Studies show that SRTP processing consumes negligible CPU power on modern hardware. For a typical G.711 codec call, SRTP adds only a small amount of bandwidth compared to plain RTP. It keeps jitter low and latency minimal.

However, SRTP has a major dependency: keys. To decrypt the audio, the receiver needs the encryption key. Where does that key come from? Usually, it is exchanged inside the SIP signaling messages using SDP (Session Description Protocol). Here is the big mistake people make: if you send SRTP keys over unencrypted SIP (no TLS), those keys travel in plain text. An attacker can easily grab the key from the SIP header and then decrypt the entire SRTP audio stream. So, SRTP without TLS is a false sense of security. You must use both.

The VPN Trap: Why Tunneling Everything Hurts Quality

Many IT administrators reach for a VPN Virtual Private Network, a technology that creates a secure, encrypted connection over a less secure network, such as the public internet when they hear "secure." It makes sense. A VPN encrypts everything-signaling, media, web traffic, emails. It treats the remote user as if they are sitting right next to the desk in the office. But for VoIP, this approach often backfires.

Voice is sensitive to delay. Humans notice latency above 150 milliseconds. Jitter (variation in packet arrival time) above 30 milliseconds causes choppy audio. VPNs add encapsulation headers to every packet. They require encryption and decryption at both ends. This processing takes time. In real-world tests, placing VoIP traffic inside a standard SSL/TLS VPN can increase latency from 50 ms to over 140 ms. Jitter spikes can jump from 10 ms to 150 ms or more. On congested connections, this results in robotic voices, dropped words, or complete call failures.

IPsec VPNs perform better than SSL VPNs for voice because they operate at Layer 3 and support QoS (Quality of Service) markings. However, they still add overhead. Unless you have a dedicated, high-bandwidth link, forcing all VoIP media through a VPN tunnel is risky. It trades security for usability, and often loses on both fronts if the quality degrades too much.

Comparison of VoIP Security Methods
Feature	TLS + SRTP	Full-Tunnel VPN
Signaling Encryption	Yes (via TLS)	Yes (encapsulated)
Media Encryption	Yes (via SRTP)	Yes (encapsulated)
Latency Impact	Low (negligible)	High (adds 50-100+ ms)
Jitter Risk	Low	High (especially with SSL VPNs)
Complexity	Moderate (PKI management)	High (client config, firewall rules)
Best Use Case	Standard remote workers	High-security zones, legacy apps

Cartoon phone racing freely vs one stuck in a slow, congested VPN tunnel.

When Should You Actually Use a VPN?

Does this mean VPNs are useless for VoIP? No. They have specific roles. You should consider a VPN when:

Legacy Equipment: Your older IP phones do not support SRTP or TLS. Wrapping them in an IPsec tunnel hides the insecure traffic from the public internet.
Network Segmentation: You need the remote phone to access internal resources beyond just the PBX, such as internal databases or intranet sites, requiring full LAN visibility.
Compliance Requirements: Certain industries mandate that all traffic, regardless of type, must traverse an encrypted corporate gateway before touching the public internet.
NAT Traversal Issues: Some complex home networks block UDP ports required for RTP. A VPN can bypass these restrictions by tunneling traffic over TCP or UDP port 443.

In these cases, choose an IPsec VPN over an SSL VPN. IPsec supports DSCP (Differentiated Services Code Point) marking, which allows routers along the path to prioritize voice packets. SSL VPNs typically strip these marks, making QoS impossible. If you must use an SSL VPN, ensure it supports DTLS (Datagram TLS) over UDP, which performs significantly better for real-time media than TLS over TCP.

Building the Right Architecture for 2026

The best design for most organizations today is a hybrid approach. Start with TLS and SRTP as your baseline. Enable SIP over TLS on your PBX and configure all endpoints to negotiate SRTP for media. This gives you end-to-end encryption between the phone and the server with minimal performance impact.

Then, evaluate your remote users individually. For employees on stable broadband connections, direct TLS/SRTP is perfect. For those in high-risk environments or using outdated hardware, deploy a site-to-site IPsec VPN or a client-based IPsec solution. Avoid forcing every single user into a generic SSL VPN portal unless absolutely necessary.

Don’t forget the last mile. Even if your internal network is fully encrypted, the call eventually hits the PSTN (Public Switched Telephone Network) via a carrier trunk. Most carriers do not support SRTP on the final leg to landlines. Your call is secure from your office to the carrier, but may be exposed after that. Understand this limitation. Focus on protecting the segments you control.

Finally, manage your certificates. TLS relies on PKI (Public Key Infrastructure). Expired certificates cause immediate call failures. Automate certificate renewal using tools like Let’s Encrypt or internal CAs. Test your setup regularly. Use Wireshark to verify that SIP is indeed running on port 5061 and that RTP payloads are encrypted. Security is not a one-time configuration; it is an ongoing process.

Can I use SRTP without TLS?

Technically yes, but you should never do it in production. SRTP keys are usually exchanged within the SIP signaling messages. If SIP is not encrypted with TLS, those keys are sent in plain text. An attacker can easily capture the keys and then decrypt the SRTP audio stream, rendering the encryption useless.

Does SRTP add significant latency to calls?

No. SRTP is designed to be lightweight. It adds negligible processing overhead and minimal bandwidth expansion. Modern processors handle AES encryption instantly. You will not notice any difference in call quality due to SRTP alone.

Why does my voice sound robotic when using a VPN?

This is likely caused by increased jitter and latency. VPNs add encryption overhead and can introduce variable delays in packet delivery. If you are using an SSL/TLS VPN over TCP, the performance is often poor for real-time voice. Switch to an IPsec VPN or ensure your SSL VPN supports DTLS over UDP for better results.

Do all VoIP providers support TLS and SRTP?

Most modern UCaaS and CPaaS providers support TLS and SRTP as standard features. However, always verify with your provider. Some older legacy systems or budget carriers may only offer unencrypted SIP. Check their documentation for mentions of SIPS URIs or SRTP negotiation.

Is VoIP completely secure end-to-end?

Rarely. While you can encrypt the call from your phone to your PBX and to your carrier, the final leg to a traditional landline (PSTN) is often unencrypted. True end-to-end encryption requires both parties to use compatible VoIP systems that support SRTP throughout the entire path.

Dawn Phillips

I’m a technical writer and analyst focused on IP telephony and unified communications. I translate complex VoIP topics into clear, practical guides for ops teams and growing businesses. I test gear and configs in my home lab and share playbooks that actually work. My goal is to demystify reliability and security without the jargon.