There is a common problem when auditing staging enterprise sites inside corporate networks.
If you work in-house, you first connect to the corporate network using a VPN client. Then, you need to run auditing tools to review the pages.
The only tools that work are the ones that you can run directly from your computer. For example, the ScreamingFrog spider, which is a downloadable program.
However, many enterprise sites have millions of pages which makes crawling from your computer impractical due to time constraints or machine resources.
Enterprise cloud-based crawlers like DeepCrawl, Ryte, Oncrawl, etc. are better suited for this type of work. But, they are not able to audit sites inside private networks.
If you work agency-side, you have the extra complication that security and privacy compliance is now a requirement to work with enterprises. It is common to have to complete extensive security questionnaires before you are even considered as a vendor.
The content in the staging site inside the private network might not be ready to be opened to the public.
Introducing network admin tools for SEO
In previous articles, I’ve mentioned the importance of being aware of tools and techniques used in the development and IT industries. In this article I’m going to continue to make the case for that.
Let me introduce a couple of tools that are familiar to network and system administrators: ngrok and mitmproxy.
We can use ngrok to turn private (VPN required) URLs into temporary and public ones. We can use mitmproxy to make changes to the pages and hide and/or obfuscate the content and preserve its privacy. This requires writing simple Python scripts.
Proxies and HTTP Tunnels
Before I dive in and play with the tools, let me go over their underlying concepts.
“When navigating through different networks of the Internet, proxy servers and HTTP tunnels are facilitating access to content on the World Wide Web. A proxy can be on the user’s local computer, or anywhere between the user’s computer and a destination server on the Internet. This page outlines some basics about proxies and introduces a few configuration options.“
Proxies and HTTP tunnels are standard approaches to relay requests/pages and make them available from once source site to another. Please review the linked article to learn more about the topic.
Ngrok creates HTTP tunnels and mitmproxy is a reverse proxy.
These are two different use cases that are a good fit to solve the problems I mentioned at the start.
Let’s say your staging site is https://staging.internal-network.net:8080 and you are only able to open the page after you connect using the VPN client.
You could expose this site temporarily so you could verify Google Search Console and Bing Webmaster Tools, and run the URL inspection tools (or enterprise crawlers) on the exposed URLs.
Here is how you do that:
- Download and install ngrok for your Mac or Windows PC.
- Open a terminal window and launch ngrok.
Ngrok is a command line tool, so you need to run it in a shell and pass parameters to make it work.
Now let’s create the HTTP tunnel and temporary URL.
./ngrok http staging.internal-network.net:8080 > ngrok.log 2>&1 &
Here I am asking ngrok to expose the web server that is only accessible from my computer at port 8080. I added some extra commands to log any errors to ngrok.log and finally want the process to run in the background and let me type more commands.
I check the log has nothing and that means it should be working fine. Next, I need to get the public URL generated.
I need to make an API call to the service, which returns a JSON response that I need to parse. We are going to simplify this part by downloading another handy command line tool, jq.
Assuming you also have curl, you can get the temporary URL with this command.
curl -s http://localhost:4040/api/tunnels | jq ".tunnels.public_url"
You should get a URL that you can open in your web browser like this:
After you open it, you will see the internal site. Try using the Rich Testing Tool on it (the URL you get, not this example) and it should work. How cool is that?
As you don’t own the ngrok.io domain, you need to take an extra step in order to register with Google Search Console and Bing Webmaster Tools.
Before you create the tunnel, you need to authenticate.
./ngrok authtoken <token>
Then, you add another parameter to specify the custom domain while you create the tunnel.
./ngrok http -hostname=dev.yourdomain.com staging.internal-network.net:8080 > ngrok.log 2>&1 &
You will be able to register this subdomain and run the URL inspection tools (or your favorite enterprise crawler).
So, we learned to expose staging sites inside the corporate network using temporary public URLs. But, what if we couldn’t risk making the content public and inadvertently reveal unannounced news that could hurt a publicly listed company?
One option is to layer in a reverse proxy and use it to hide or obfuscate any private information in the HTML and/or images to preserve the company’s privacy.
Mitmproxy is an awesome HTTPS proxy that, among many things, allows you modify the HTTP traffic going through it on the fly, even HTTPS, which is encrypted!
You can make simple text replacements in the command line or any arbitrary modifications by writing simple Python scripts.
Mitmproxy can operate in several modes, we are interested in its reverse proxy one.
It is a Python package, so you can install it using.
pip install mitmproxy
Then call it using.
mitmproxy -P 8081 --mode reverse:https://staging.internal-network.net:8080
Let me illustrate this powerful technique with one example.
I’m going to reverse-proxy StackOverflow and change the text in their H1 from “People” to “SEOs”
mitmproxy -P 8081 --mode reverse:https://stackoverflow.com/ --modify-body '/ people who code/ SEOs who code'
Let’s open the browser on http://localhost:8081 and see if it works.
Kaboom! Now tell me this isn’t exciting stuff 🙂
The idea is to replace any text or images that shouldn’t be exposed publicly.
You would need to run ngrok afterwards instructing it to connect to this reverse proxy at port 8081 instead of directly to the source server.
./ngrok http -hostname=dev.yourdomain.com localhost:8081 > ngrok.log 2>&1 &
MIT stands for (Man in the middle attack), which is an information security concept that means there is an intercepting device/element in a two way conversation. This device can sniff or tamper with the information transmitted.
As you can imagine, this could be used for nefarious purposes. Fortunately, in our case, we want to use it for good. We want to hide/obfuscate sensitive information from internal pages before exposing them publicly with ngrok.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.