Zühlke – Empowering Ideas

New Technologies

Accelerate time to market with an automated Augmented Reality model

AR technology
11 minutes to read

With recent advancements in hardware performance and technology, Augmented Reality (AR) applications have become more mainstream. Other than immersive applications such as gaming, AR technology is also making its way into other industries such as manufacturing, retail, and medical.

In the retail industry, AR technology makes shopping more immersive and satisfying. From the customers’ perspective, AR supports their purchase decision and enhances their shopping experience by being able to ‘see and touch’ the products. For manufacturing, AR helps to reduce prototyping costs and waste. For example, product designers can visualise the final product before committing resources for mass production. One potential use case for the medical community is allowing clinicians to train themselves on how to use new medical equipment.

The major pain point with the current AR workflow

 For AR to work, 3D models of actual products need to be created, curated, and uploaded to the organisation’s servers. AR code is then generated for users to scan with their mobile devices. For OEMs, creating AR models from 3D models can be trivial since they already have in-house design documents and/or existing design files for their products.

However, non-OEMs would need to hire 3D modellers and use modelling software such as 3DMax, AutoCAD or Blender to create the 3D model from the images of a given product. Here is what a manual AR Workflow could look like for non-OEMs:

  • Capture the images and send them to a 3D modeller.
  • 3D modeller creates the AR model using 3D modelling tools.
  • Export the model to a common AR format, mainly glTF and usdz
  • Stakeholders review the model with the modeller(s).
  • Import/upload the model to the platform for their customers to view
Manual AR workflow

Manual AR Workflow

From the points raised above, we identified the stages for improvement. The most obvious step is the manual creation of the model. Those familiar with 3D modelling would recognise that it is both manual and time-consuming. Creating, reviewing and iterating on the design for a single product or object could take several days, even for a highly skilled 3D modeller. Organisations without an in-house modeller would take time to hire one too.

Integration with the Application Programmable Interface (API) is needed to automate the process, which may not be available as these 3D modelling software are usually proprietary. Even with APIs, software engineers still need to create custom software tools for integration.

For non-OEMs with limited resources or operating in a highly competitive environment, there is a pressing need to improve the existing workflow to enhance cost and labour efficiency.

How can existing process be improved?

A different approach is needed to automate the current workflow. You might be wondering if there is a way to convert images into actual 3D models? The good news is, yes, there is!

Technology gave rise to a technique known as Photogrammetry [1], which can generate 3D model from images. By capturing several pictures of the target object and applying a computational process, we can generate the 3D model from them.

With the release of MacOS 12 by Apple, there was an extension of the RealityKit Framework [2] to support the implementation of the Photogrammetry process. With that,  developers can create API based AR models from images programmatically without lengthy manual image processing and integration with complex 3D modelling software.

Automated AR workflow

API AR Generation

With Photogrammetry, the whole AR model generation workflow now looks like this:

  1. Capture the images with an App
  2. Upload the image to Object Capture API
  3. Trigger image processing
  4. Render the generated model using <model-viewer> component in HTML page and interfacing with native capabilities

In comparison to the manual modelling process, this is much easier. Anyone with decent photography skills can take photos of a product to generate an AR model. The AR model is available within a few minutes, making it possible for a quick turnaround time if a new model is needed.

We developed a proof of concept (PoC) to illustrate the above approach.

The development of a mobile app

Based on the workflow mentioned above and that the RealityKit framework is only available for Apple products, we need a solution that runs on Apple’s hardware and can perform the following:

  1. Capture and hold a series of images of the object
  2. Guide the user on how to take the pictures
  3. Generate the AR model from the images.

The most logical conclusion that we came up with is an iOS app that incorporates all criteria.

After capturing the images, users upload them to a service via the app and trigger the AR model generation. As AR model generation is expensive, the process will run asynchronously on the mobile app. After the completion of the process, the client will receive a notification.

Users can then verify the result, and if it is not up to their expectations, they can tweak the generation parameters to regenerate better results.

The development of a backend application for AR model generation

After the user uploads the images, the solution uses the Object Capture feature of the RealityKit Framework to generate the AR model. However, based on the information and sample code provided by Apple, the solution requires the following:

  • MacOS 12.0+
  • Discrete GPU on the machine
  • At least 4GB of dedicated video RAM with ray tracing capability

As the iPhone and iPad do not have the complete requirements, we developed a microservice using Swift that can run on Apple’s personal computers and workstations with either an AMD GPU or an Apple Silicon.

For our PoC, we used an on-premise MacBook to run the microservice since almost no big cloud service can provide M1 Mac. The mobile app communicated with our backend via a set of REST APIs, which we developed using Vapor web framework.

(As of now, AWS only provide non-GPU enabled Intel VM and M1 is only in preview. Few services on cloud provide M1 machines. In our case, MacStadium is one of the possible alternatives.)

An in-depth look at the implementation process

How should organisations go about their implementation? In the next half of the article, we will explore the design and implementation details.

The importance of AR file format

As part of the AR model generation, we also need to consider file format support. There are many file formats for AR modelling and there is no format that is as universal as jpeg or mpeg for image and video, but these two are the most popular:

  • glTF - universally recognised, except for iOS
  • USDZ - iOS specific


The GL Transmission Format (glTF) is a runtime asset delivery format for GL APIs: WebGL, OpenGL ES, and OpenGL. By providing an efficient, extensible, interoperable format for the transmission and loading of 3D content, glTF bridges the gap between 3D content creation tools and modern GL applications. [3]


USDZ is a 3D file format that displays 3D and AR content on iOS devices and offers portability and native support so users who do not need to download a new app to open the file. In addition, iOS 12 iPhone and iPad users can easily view and share this portable format and developers can exchange it between applications in their 3D-creation pipeline.

Currently, only iOS supports USDZ and has its equivalents, gLTF and gLB. [4]

Support comparison

glTF vs USDZ support:







Based on the table comparison, none of the formats stood out. While glTF is used more frequently, USDZ allows for high quality iOS rendering. In addition, Apple ARKit makes model generation faster with the limitation of creating only USDZ model file.

Note: iOS devices can automatically convert glTF to USDZ but with deteriorated quality.  If possible, it is still better to provide the USDZ file.

The intricate details of our backend microservice

The backend microservice does the heavy lifting of receiving the images from the clients and processing them to generate the AR model. The diagram below illustrates what we imagine the overall architecture to look like:

Overall architecture

API Software component

Through further analysis, we identified the following interaction flow:

  1. Start a session and upload related metadata
  2. Upload all the files to the session
  3. Triggering the processing of the files
  4. Model validation
Object Capture API

API interaction sequence

We know there is a need for RESTful APIs to receive images from the client. Then, the client should be able to indicate in the microservice through the APIs to generate the model and save it for subsequent download and rendering.

Here is where a web framework comes into play. For this task, we chose Vapor.

Vapor Web Framework

Vapor is the lightweight web framework of choice. It enables faster development of web applications with RESTful APIs and can be deployed on Docker and Linux machines. However, we required MacOS-specific API because of RealityKit and that eliminated both Docker and Linux as options.

As for the Kitura web framework, we did not go with that as its development has not been very active.

On the other hand, Vapor is still undergoing active development and receiving support from its developers. Furthermore, we have the requirement to build a multi-architecture application. Another advantage is that Vapor is built for both x86-64 and arm64 architecture on any macOS.

The importance of concurrency

The model generation process requires a significant amount of CPU and GPU power to create the AR model, taking several minutes on an Apple Silicon such as the M1 SoC or even more when running on Intel processors. This impacts the overall performance and response time of the system. As a result, the operation cannot happen on the same thread as the RESTful API that receives the images or the main server event loop.

So, what we can do?

Use concurrency and async processing.

The AR model generation must be a parallel task triggered by the API or other mechanism such as a timer job. This allows the APIs (such as the image upload API) to behave in a fire-and-forget manner.

However, this comes with challenges.

With parallel processing, the client cannot receive the current processing status updates via the API. Instead, what we can do is display the progress percentage of the activities via notification. Our solution involves implementing a notification system, as shown in the diagram below.

Notification system


With this notification system, the AR generation process independently notifies the client about the progress and status. The supported statuses are:

  • absent
  • inProgress - with a percentage
  • created
  • error - with a message

Service Deployment

Our CI pipeline generates a multi-architecture executable that can run on M1 and Intel chips.

Run the following command to generate a multi-architecture executable in the terminal:

swift build --configuration release --arch arm64 --arch x86_64

Further Improvements: Scaling up and optimising the solution

As mentioned, generating an AR model from images is an intensive process that requires a lot of hardware resources and specific hardware. Imagine a scenario where batches of images for multiple objects are captured and uploaded to the backend within a short timeframe. This leads to an overwhelmed processing pipeline.

Therefore, a more efficient way of achieving better scalability is to separate the API from the processing pipeline, use a message queue (Kafka, SQS, etc.) and auto-balance the result.

As our approach is still in the Proof of Concept (PoC) phase, solution scalability is not a big issue. However, this could be a significant constraint in the real word and require a carefully designed architecture and deployment.


We tested the application with various objects in different environments. On average, the resulting models were good, with good object recognition and decent accuracy. However, the result heavily depends on the environment. For example, if there is poor lighting or noisy background sound, the results will not be ideal.

During our testing, we discovered that a minimum of 20 images of the object is needed for RealityKit to generate a reasonably good model. Higher quality models or more complex objects will require up to 25 to 40 images.

The model generation process only takes a few minutes on average, making it easy to trial and error while having many images captured in a relatively short time.

The RealityKit framework from Apple makes automated generation possible with predictable results and comprehendible errors, enabling wider access to AR technology. However, one of the most significant limitations is the inability to programmatically generate a glTF model from an usdz model or having the output of the model generation in glTF.

With our approach of working with Photogrammetry framework, we strongly believe that this will help to democratise AR modelling and create exciting new business opportunities for this technology.

Kiky Shannon Zühlke
Contact person for Singapore

Kiky Shannon

Head of Market Unit, Cross Markets APAC

Kiky is passionate about developing trusting relationships with clients and partners, accomplished through innovative problem-solving. With more than 20 years of track record, he has gained extensive experience and valuable insights through various roles that allowed him to have an authentic perspective on the technical, business and people aspects. At Zühlke, his current focus is on driving the strategic market growth for Cross Markets in the Asia Pacific. As a learning enthusiast, he mindfully balances building business and being an engineer at heart.