Thursday, August 8, 2013

Architecture principles...


Introduction

Looking at architecture it is always very important to have a set of well defined principles which can be used to assess and review an architecture, but also give to the developers the rules of engagement they will need to consider in their design and implementation. The principles cannot be overwhelming and need to be constantly assessed to measure their effectiveness. So here it is, I will first give a summary of each principle followed up by a more detailed description. There are 11 principles which seems to be quite useful when they are being used. There is no order of priority but some of the principles should be considered mandatory for a proper success of the project.

Abbreviated principle list

  1. Managing Failure (Mandatory): Embrace failure: don't try to prevent it, but manage it. Presence of failures is the rule, don't see them as exceptions. Assume everything fails, like:
    • Power faults: plan for it, could happen with unexpected frequency
.
    • Double faults happen: expect it, don't just assume single independent failures when creating scenarios of doom.
    • Partial failure: handle it, never assume partitioning can't happen in a data enter.

  2. Instrumentation and Logging (Mandatory): Know what's happening, before you try to improve anything. We will need a great deal of runtime information either in real time or batch mode:
    • Systems designed to expose runtime information and measure performance (including latency).
    • Data rich enough to provide meaningful information about systems and users
.
    • Events carry correlation information to provide a big picture view of high level and complex events. The correlation can be:
      • Horizontal: between system elements of the same type: hardware or software.
      • Vertical: going thru the infrastructure, os, application and service levels.

  3. Design for the cloud (Mandatory): Design applications/software elements for a dynamic cloud environment, don't assume your infrastructure.
    • Partition: avoid funnels or single points of failure. The only aggregation point should be the network itself.
    • Plan on resources not being there for short instances of time: break the system apart into pieces that work together, but can keep working in isolation at least for several minutes.
    • Plan on any machine going down at any time: build mechanisms for automated recovery and reconfiguration of the cluster (see Principle 1).
    • Implement elasticity:
      • Automate the deployment process and streamline the configuration and build process (ensure the system can scale without any human intervention)
.
      • Every instance should have a role to play in the environment (DB, FE, BE,..), which could be passed as an argument that instructs machine image (e.g. grab necessary resources)
.
    • Based on parallelization, multi-threading when accessing (retrieving / storing) data and consuming resources, handle deadlocks (2-way, n-way, phantom) with proper detection prevention and avoidance algorithms and built wait-for graph.
  4. Most these principles are classic internet design principles and not specific only for a CLOUD development.

  5. Divide and Conquer (Mandatory): It's all about boundaries and... Rome went bust when it became too big...
    • Divide big, complex problems & systems into smaller, simpler components.
    • Choose the best solution, tools & technology available for each component.
    • Optimize each component for the most frequent tasks it will need to do.
    • Aim for the minimum overlap of functionality between the components.
    • Avoid tight dependencies between components.

  6. Latency is not zero: It exist! Embrace it: latency is the mother of interactivity. But try hard to reduce it Latency hurts (customers AND our revenue!):
    • Amazon: every 100ms of latency cost them 1% in sales.
    • Google: extra .5 seconds in search page generation dropped traffic by 20%.

    • The less interactive a site becomes the more likely users are to click away and do something else, e.g. use the competitors site (e.g. latency in games is a make or break success)

    • Every API should be cache-able
Cache should be closer to the end-user.

  7. Almost infinite scale (Mandatory): Assume the number of objects will grow significantly and more than you thought Almost-infinite scaling is a deliberately loose way to motivate us to:
    • Be clear about when and where we can know something fits on one machine.
    • What to do if we cannot ensure it does fit on one, two or three.... machine.
    • Want to scale almost linearly with the load (both data and computation).
    • Develop a plan and documentation how to extend capabilities ad-hoc.

  8. Relaxed Consistency: Trade some consistency for availability in partitioned databases.
    • BASE diametrically opposed to ACID:
      • ACID (Atomicity, Consistency, Isolation, Durability)
: pessimistic and forces consistency at the end of every operation,
      • BASE (Basically Available, Soft state, Eventual consistency): optimistic & accepts that consistency will be in a flux but eventual.

    • While this sounds impossible to cope with, it is quite manageable and leads to levels of scalability that cannot be obtained with ACID.
    • BASE:
      • Basically Available: it will always be there, even if some nodes aren't, reads for all data likely still served by caches.
      • Soft state: Writes and Updates return quickly but may take time to propagate, reads will never block but might return a previous version.
      • Eventually consistency: If you wait long enough data will eventually be consistent.

  9. Design for Security (Mandatory): Treat Security requirements as 1st class citizens in your design:
    • Define ownership of data and take data life-cycle into account.
    • Define multi-tenancy level and handle it within all system components:
      • Infrastructure: separate network, separate blades/rack (up to either separate location, or locked cages) -> expensive at the whole stack level.
      • OS: separate OS/with or without virtualized environment/container on shared infrastructure -> good level to start thinking about multi-tenancy
      • Application: Multiple instance on the same OS -> expensive: licensing model.
      • Service: Logical separation at the database level within the same application -> dangerous: the application may not fully understand the logical separation via data.
    • Handle delegated administration as part of the basic design:
      • Individual user with self administration or administration on behalf.
      • Cascading notion of group of users (company, departments within company, groups within department or group, etc.. with administration policies at each level.

  10. Don't reinvent the wheel: Did somebody else already solve your problem? You're likely not trying to solve a totally new problem:
    • Look at the Open Source, the commercial market or SaaS solutions (the good ones follow similar architecture principles and have exposed API) for things you can reuse.
    • When using Open Source, use true managed Open Source
 (see if there is the company that has initiated the project, there is a big chance that they have an enterprise version).
    • There's also a big chance it has been already used in a production environment. This will save you time you can use to focus on a proper business logic with all the architecture principles!
  11. Worse is better: Solve only 80% of a problem: that's usually good enough.
    • Worse-is-better solutions not only get done faster, they actually get done at all.
    • Think about a design that:
      • must be simple.
      • correct in all the observable aspects & not overly inconsistent.
      • cover as many important situations as it is practical.


  12. Solid Design Patterns: Good design gives faster delivery, change and complexity management.
    • Based on well proven patterns and anti-patterns that help you solve problems in an efficient and elegant way and makes the software less complex and cheaper to maintain.
    • Learning and understanding those patterns must be paramount for any developer that joins the project.

This list does not pretend to be exhaustive but cover major pitfall I have seen in different projects in the different companies I have been in. 

Detailed principle list

This section will cover each principle with the following structure:
  • Statement: What has been already described in the previous list
  • Rational: A description of why it is important by looking at what is happening if the principle is not followed
  • Implication: consequence of what development and other teams should or need to do
  • Industry best practices: when applicable a non exhaustive list of example of what has been done at the industry level
  • Detail Recommendations: when applicable a non exhaustive list of practical actions which cove the principle
The detailed principle list is a pdf document.

Conclusion

As usual for all of these principle lists, it will always be possible to find missing items, and it should always be possible to add or remove a principle or detail of a principle to fit more with the demand and the constraint of a specific environment.
It is also important to understand that there will always be exception, however since the principles are documented, it is easy to then assess the risk of the exception for not following a specific principle.

Friday, July 12, 2013

API Exposure Platform

Introduction

APIs have become the new hot topic and, while still being quite immature, may have reached the tipping point to full recognition as a key component in any software project. It is important to recognized the three different environments involved in the API supply chain:
  1. On the bottom there is the API implementation environment which represents the set of systems/platforms/frameworks that actually makes the APIs work.
  2. In the middle there is the API exposure environment which represents the system/platform that helps the APIs to be available to developers (short tail, long tail) and also handles all the aspects associated with making an API commercially viable.
  3. Finally, on top there is the API consumption environment which represents the set of systems/platforms/frameworks that facilitate the consumption of exposed APIs in order to create new solutions.
What makes this segmentation particularly complex is the fractal nature of APIs since when a solution is created by developing a piece of software that consumes API's, it is possible for that piece of software to actually implement an API and while part of the consumption environment the same piece of software and the systems/platforms/frameworks that support it, are simultaneously part of the implementation and consumption environment. However for matter of simplicity, one approach is to only work at the first level of the fractal but understand that this pattern exists, How many time the recursion happens may be bound by three constraints/observations:
  • not every solution actually implements an API, 
  • not every software solution that implements an API is made by composing API's (which may then be called a canonical API),
  • composing API's recursively may lead to complexities like functionality echo.
This paper will describe the different environments and focus on the functionality and specific characteristics of the exposure environment by describing the API exposure platform.

The implementation environment

This environment is represented by a wide variety of technologies/architecture since implementing a canonical API may follow a session driven paradigm, an enterprise architecture paradigm, use a application server or may be based on a completely homegrown solution depending on the domain in which the API is implemented. Looking at the telco industry for example we can identity four implementation domains:
  1. Network,
  2. BSS (Business Support Systems),
  3. Products and Services,
  4. Infrastructure,
and each of these domains generally follows a different paradigm well represented by the framework/platform used to implement the solutions within these domains. IMS (IP Multimedia Subsystems) in the Network domain has a session centric architecture, while SDP (Service Delivery Platform) in the BSS domain follows a more classic enterprise architecture with a enterprise service bus that mediates all the events/messages that are exchanged between the different sub-systems attached to the bus. Products and Services domain follows a more internet centric architecture where the base platform called service platform is actually a set of individual tools specialized in federating many other internal or external components. Finally the Infrastructure domain follows a broker based approach (analytic or policy driven) to handle the different IT/network resources available either within data center or in the network.

The consumption environment

This environment is where an exposed API is being consumed. There are many ways to consume an API: directly, or via an intermediary (SDK, adapter, broker, platform, framework) and this with a full IDE or a simple text editor. The consumption can happen at multiple levels from the device that delivers an experience (user or not), the front end technologies that support the device, or the back end systems (broker, platforms, frameworks, or application servers..) used to developed software services.

This consumption domain has followed a natural evolution of the software development which started by using local subroutines, to libraries from local to distributed, to remote objects accessible via local stub, to software services that implement APIs. This evolution has introduced multiple new levels of complexity inherent to distributed systems like concurrency, deadlock, service latency, network failure, availability constraint, elasticity, on demand access of base resources which have generally lead to the following realizations:
  • Never hide distribution to the consumer of an API: latency, network error will happen and API consuming solutions should have a way to handle these problems. (Jim Waldo interview on distributed systems), introduction of DTN (Delay Tolerant Network) by Vint Cerf.
  • Never hide elasticity but exposed it via API and continuously expand the criteria to trigger it. Expand the criteria for refining the on-demand access of base resources and expose this mechanism via API. The API consumers will hve the logic to either deal with the elasticity triggers and will make the requests to get or rekease resources.
  • Dealing with a API controlled by an independent party implies (especially for mission critical solutions) more discipline around system management (FCAPS (Fault, Configuration, Accounting, Performance, Security) model will apply to an API or the implementation of it).
Today most of the API consumption is best effort and some of the previously described issues are not yet addressed but more and more it will be important to handle system management issues (one of the role of the exposure environment) and ultimately develop fluid solutions.

The exposure environment

Also know as API exposure,  the exposure environment is an important step in the API supply chain since it allows a consumer (developer, partners, etc..) to find the available API's and interact with them in order to build simple best effort applications or complex mission critical systems..

Platform functionality

While there are many ways to implement and consume an API, exposing an API can and should be handled thru one platform (eventually with many logical instances), by the time special attention on not being API semantic specific is constantly taken care of.

Beyond the classic aspects of a community based platform including blogs, review, forums and even some form of gamification to recognized members of the community, from a top down perspective, the API exposure platform must address the following:
  • Developer registration: a developer can be a single person consuming an API as a hobby or can belongs to a large corporation developing enterprise grade solution. It is possible to have many level of developers (gold, silver, bronze...) and different structures (single developer to groups of developers). While many aspects of the platform should be accessible without registration, it is important to create a structured community of developers by registering and categorizing these developers.
  • Solution registration: based on the state of development, a solution (application, software service etc..) that consumes an API need to be registered and this for analytic purpose or for business model perspective. The developer could be used for that but it will be important to know more details about the solution that actually consumes the API. This is borderline with the consumption domain, but at least the meta data of the solution stored during the solution registration gives enough information of the context in which the API is being used to the API exposure platform.
  • API authentication/authorization: the developer and or the solution have to be authenticated in order to be allowed to use an API (may be only at scale). This may actually be dependent on the business model and at which stage of development the solution is. A more complex aspect that also needs to be handled is the authentication of the user that has downloaded the application or the consents that user has given for a solution to act on behalf of the user.
  • API description and usage management: is the facility to make the understanding and access of the API easy and quick: hypertext documentation, code repositories (with optionally the associated binaries), sandbox to understand the API by trying it, are tools that are been used to make the API successful. Based on the state of the solution that consumes API/based on the developer, throttling rules may be applied on the API calls for either protecting the implementation or avoiding misuse of the API.
  • API business model: is the commercial part of the API exposure since it describes what are the commercial conditions to use an API like:
    • usage volume limit (per time period for example),
    • usage volume before contract,
    • cost model on API, on assets,
    • revenue share if any.
    There are many business models that have been already described by John Musser.
  • API analytics: provides that ability to feed many other functionality (like the business model) but also is a way to adapt the APIs to the demand by understanding the patterns of usage or non usage. API exposure can become a important source of information since it will describe the popularity of the business assets.
  • API exposure state:while the API implementation can be at different states of development (simulated (downloadable code), simulated end point, alpha, beta...), the exposure itself can be at different states. For example the API may not have a business model, or the API may be just a specification. It is very important to explicitly and independently describe the state of development of the API implementation and the state of the API exposure to the developer.
  • API for management: is a new aspect of what API exposure platform should enforce. It represents the different management aspects an API should have in order to be able to be used in enterprise grade solution. Normalization around common management semantic, ability to assess the health of the API are example of concepts an API must have to be part of a controlled environment. This concept is not new, and in distributed systems (network for example) there are models like FCAPS to cover the different attributes network elements need to have in order to make a network, composed of a large variety of these elements, manageable. A similar problem exists on API and while this may influence the API implementation, this can be enforced/facilitated by the API exposure platform.
All of these operations can be done independently from what the API semantic is and therefore can be handled by a single platform. However due to the inherent complexity and lack of flexibility of the some of the implementation platforms, which makes changes difficult or expensive to do, real implementations generally lead to API transformations which may leak in the exposure environment. These transformations must be clearly decoupled from the core functionality since it may create scalability and maintenance issues on the API exposure platform itself and ultimately should be migrated into the proper environment. Here is a list of transformations that is addressed by the API exposure platform and the environment in which the transformations should happen:
  • API adaptation (implementation): often the API is not ready to be exposed as is and needs to be adapted to fit more the consumer demand. Protocol adaption (SOAP to REST for example), resource adaptation (grouping multiple methods) are examples of type of adaptations that can be performed on APIs before exposure
  • API customization (consumption): dealing with a large partner may lead to a negotiation between different parties in order to define what will be the information exchange and how it will happen. An API customization may need to happen in order to comply to the negotiated interfaces between the different parties.
  • API specialization (consumption): while an API should as much as possible be agnostic to the API consumers (that API can then be called a wholesale API), it may be necessary to specialized the API in order to fit the specific requirements of an API consumer (that API is then called a retail API). Protocol optimization, usage of legacy protocols, message grouping to minimize chattiness are examples of activities to specialized a wholesale API to become a retail API.
  • API profiling(implementation): an interesting case of transformation is modifying the response of an the API based on the level of the developer/application. For example a bronze level developer may have access to different precision than a gold developer when using a location API. While the definition of the levels is in the exposure domain this transformation of the API should be in the implementation domain but often leaks in the exposure domain.
  • API versioning (implementation): it is important not to hide versioning and used the proper design pattern as described by Brian Mulloy. From a system point of view the versioning should always be addressed by the implementation environment when it is a significant change but also when it is a revision which changes the behavior of the API. A policy based revision may be handled in the exposure environment.

API exchange

By providing an intermediary endpoint to the real endpoint of the implementation, the API exposure platform allows a more dynamic bindings between the API consumer and the API implementation itself. This property allows the API consumer not to be affected even if the conditions at the API implementation level may have changed.

However, there are conditions where it may be necessary to set up an API context at the API consumer level in order to make sure that the correct API is being called. This is particularly useful when a group of companies have agreed on a specific API semantic and instead of having to create an API aggregator which will act as the endpoint for the API instead of each member of the group, a API exchange (a good example has been implemented by Apigee) is being used in order to implicitly or explicitly make sure that the correct API of one member of the group is being discovered and called.
  • explicit discovery: very much like DNS, the API consumer makes a call to the API exchange and as a return receives a context  containing the specific API from one of the group member to call.
  • implicit discovery: the API consumer makes a call to the API of one of the group members. The group member cannot successfully answer the call, calls the API exchange and redirects the API consumer to the proper API from another group member based on the response of the API exchange.
The API exchange works between group members that have commonly agree on a specific standard either on the functional APIs or more probably on the APIs for management. 

Other aspects that are resolved by the API exchange are:
  • the settlement aspects between the members of the group in case of cross billing handling,
  • the developer term and conditions which will actually allow either the API discovery and/or the API redirection,
  • part of the user consent to allow restricted information to be used by the API consumer.
The API exchange can either be defined as a stand alone service (like DNS) or could be defined as an add-on of the API exposure platform. 

Conclusion

Many companies are jumping in the exposure environment either because they are already covering specific aspects of the API exposure or because it may be a natural evolution of their existing products. However the tendency to merge  different environments like implementation and exposure or exposure and consumption, in order to grab a bigger part of the API supply chain, leads to over-complex systems that look more like a new silo than a proper architecture and limit the richness of each of the merged environment. However the worse today is to merge implementation and consumption to create tighly coupled end to end solutions not allowing the creation of a rich ecosystem of solutions on top of a platform business, expressed by a serie of exposed APIs, mandatory to any modern business.

API analytics is a core aspect of the exposure environment and should always be treated as a priority since:
  • it allows measurement  for defining success,
  • it acts as a tools to refined the API and the business models to fit with the demand,
  • it prioritizes the core assets for implementation improvements,
  • it can become a source of revenue in indirect business models,
but to arrive at full maturity, an increased focus on API for management is necessary.

Creating a proper API exposure platform that:
  • enables the creation of a rich ecosystem of internal/external solutions that consume exposed API,
  • allows a rich ecosystem of API implementations, representing the core assets of a business, to be used,
has become, in the digital world, the key for the success of any business and is the core transformation mechanism for the struggling incumbent businesses.

Saturday, June 22, 2013

SDN: which approach, network or IT?

Introduction


After many SDN presentations about SDN by different type of vendors, I arrived to the conclusion that there are two distinct approaches on SDN: the network and the IT approach. Interestingly the two approaches lead to the same results from the network perspective, but completely differ on what is the implementation roadmap and the impact on the position of network operators in the ecosystem of digital service providers.

Approaches


Network

The network approach generally explains that by using SDN, costs (CAPEX and OPEX) will be reduced for the following reasons:
  • commodetization of the hardware thru virtualization = CAPEX reduction
  • automation of tasks by having widespread control of the forwarding plane by the software = OPEX reduction
When asking what equipments are included in the SDN , the vendors taking the network approach give the classic list of all the existing network elements like routers, switches, firewalls, load balancers, protocol converters, base stations etc... etc.. However they consider the data center resources and the devices are attached to the network, and therefore not part of the SDN story.

These vendors are also very quick to indicate the flaws of the approaches taken by the non-network people or vendors who have an SDN story:
  • the controller will not scale,
  • the forwarding plane to controller round-trip is too slow if applied to each packet.
and therefore they are very convincing in introducing a new layer and new equipments (hopefully on commodetized hardware) in order to cope with these flaws.

IT

The IT approach looks at SDN to solve a data center problem associated with the fact that if a quick availability of a new IT resources (computing, storage...) is needed, the network has to be agile and quick to configure. The network must also be very cheap to build and maintain since for IT the network is a cost, meaning that OPEX and CAPEX have to be as low as possible. And since the dynamic allocation of the IT resources is done by the software, it makes complete sense to actually let the software control the network.

The equipments included in the SDN story are of course the network elements, but also includes the IT resource themselves.

The IT approach generally considers approaching SDN by creating/implementing an overaly network with a distributed controller, controller to controller connections and controllers linked with the already existing distributed brokers (acting as controllers) of IT resources in cloud or hybrid cloud solutions. Implemented in this overlay network is the ability to cache forwarding logic which is a familiar pattern already being used in many other places (the web will never work if caching didn't exist...).

The vendors taking the IT approach generally don't have much opinion on the network approach or see it as an opportunity since once SDN is implemented in the network, it will be possible to expand the overlay network.

Implementation results 


What are then the different results of implementing SDN using the two approaches:
  • From a network perspective, there is not much difference since it is clear that the network will be cheaper to implement and maintain and will be more agile and quick to configure no matter which approach is taken.
  • From a roadmap perspective the IT approach is clearly pushing an outside-in roadmap (creation of an overlay -> to the core network) while the network approach is generally pushing to address first to expensive components hence an inside-out roadmap. However the network approach also implies the implementation of a new layer of distributed controllers. The two approaches lead to two different roadmaps but it may be possible to create hybrid roadmap to mitigate the risks.
  • From a provider perspective, the nature of the provider and the existence of the two approaches lead to a dramatic difference of positioning:
    • An OTT (Over The Top) provider will always consider the IT approach with a top-down view including the IT resources, the network and the devices (home/enterprise gateways, terminal) since it addresses key IT problems and opportunities (cost, agility, distribution of the load, distributed software....) and because the network group generally is either non existent or a minority within the IT organization. Since an OTT will generally create an overlay network, it is always possible to opportunistically expand it on top of the SDN network operator will implement,
    • A network operator will generally consider both approaches from a bottom-up view, unfortunately in separate groups: two groups (network and IT) or even three groups (core network, access and IT) and will generally not include the attached terminals anyway. Most likely the network approach will be picked since it addresses the current problems of doing more with less and because the network vendors are delivering a comfortable message with acceptable disruption.

    This situation is actually very similar to the one delivered by the same network vendors few year ago with IMS but instead of cost reduction (the new problem "du jour"), it was about innovative services (the old problem "du jour").  However once implemented these technologies have left and will leave the network operators in the same bad spot witnessing a continuously expanding ecosystem of high value OTT services (including the innovative services IMS was promising) while only seeing an ever increasing traffic of low level packets.

    In the SDN case network operators may be able to save costs (the constant increase of traffic may offset the savings)  but will not be able to be part of the IT supply chain and therefore will miss  the massively distributed cloud movement and its related solutions.

Call to action


As described in my previous blog, implementing SDN will give to the network operators an opportunity  to change the game and be part of the IT supply chain. However this means:
  1. eliminating silos (IT, network, access, terminal) within the network operator,
  2. while keeping a bottom-up view, network operators need to understand and embrace the top-down view which includes the distributed software part. Today software solutions are more distributed and dynamic/elastic than the network itself (in the 7 layer OSI model, the layer 7 has sub-layers that define a logical network where the nodes are software elements implementing API, the links are the relationships between these elements and the packets are the messages/events passed between these elements),
  3. making sure that the vendors (including the software only vendors) who have an IT approach of SDN are considered and evaluated in network environment,
  4. pushing the vendors who have a network approach of SDN to embrace the IT approach by:
    • treating any network element as IT resource (may be we should have more memory and computing pre-installed in network elements):
      • be capable of running multiple workloads (network workloads and IT/software solution workloads),
      • must be linked to the IT resource broker (cloud or hybrid cloud),
    • treating any IT resource (within data centers or on premise (home, enterprise)) as active SDN network element therefore must be linked to the distributed controllers either to provide or receive control information, 
  5. being more prescriptive on the solutions vendors/SI need to provide,
  6. considering any gateway or box installed on premise (home, enterprise) by the network operator as an active SDN network element and an IT resource,
  7. working with the device manufacturers to treat any subsidized devices as an active SDN network element and an IT resource with specific constraints, 
  8. partnering with IT resource providers to expand the type of IT resources available for the software solution developers and providers.

Monday, June 10, 2013

M2M Platform

Creating a platform is a complex and delicate task, and needs to serve multiple purposes,  one of the most important being to support and simplify  the work of the developers who will be using/consuming this platform. This is particularly true when talking about M2M platform, which is be a very specialized platform.

Looking at the device to back end infrastructure communication and integration evolution, it has always been an issue especially when dealing with "small" devices, and most of the time dedicated and domain specific solutions were developed since both the front end (device related) and the back end (digital service related) had to know about the specificity of each other’s in order to create a proper solution.

One of the main aspect of a M2M platform will be to remove these dependencies and create a proper set of boundaries which will allow a looser coupling between the devices and the digital services, provide a scalable solution (technically and economically) and create an appealing environment for developer that does not overlap with  existing platforms . In the advent of modern integration, a platform implements and exposes APIs that allows a consumer of these APIs (other digital services, app, etc...) to use functionalities  without having to worry about how these functionalities are implemented.

Core Description of the Platform

Taking a top down approach of a M2M solution what are the core problems a digital service does not need to deal with and therefore will delegate to a M2M platform:

Device connectivity

It is key that the device connectivity is abstracted and only the properties of the connection need to be exposed instead of describing the connection itself. So in a way the facts that the device is using a wireless or wired connection, a wireless connection is over WIFI or over a cell network are actually irrelevant, however the facts that the connection is of variable bandwidth, the device may or not lose connectivity (disconnected is not a error but treated as an exception), the connection is chargeable instead of being free (notion of rating and account associated with the connection)  and the connection is secure are examples of properties that the associated digital services may be interested in.

It is also key to handle how the logical connection is established and maintained: client centric (client polling /keep alive), server centric (long running connection, going thru a third party (eg. pusher) etc..) or maintained by the network. Each of these solutions have their own merits/limitations and results in different behaviors that need be exposed, not explicitly but as a set of attributes of the connectivity (eg:  real time…) to the consuming digital services which do not need to understand how the logical connection is established and maintained.

Due to the network versus application separation, it is interesting to see that there are two approaches of the device connectivity: an OTT approach which basically considers the network as an IP fabric, and the network approach which allows some optimizations based on the network knowledge. A less polarized view should be taken, and opportunistic optimizations should be applied to benefit from both world.

Implementing a proper device connection abstraction is not a trivial task and may lead to the implementation of sophisticated software elements either in the device, at the network level or in server components which are actually implementing the  device connectivity API themselves. The distribution of such elements may also vary with the intelligence (ability for the device to store and  execute a pre-installed payload or a payload coming from a server) of the device, and therefore may be device heavy, server heavy or a combination of both.  The network and the different systems involved will need to be flexible enough to handle (physically, economically...) the scale of the solution. On the server side the major difficulty of the device connectivity is actually the handle a large amount of long running connections which actually may not carry much load, but because of the number of connections, each servers handling them will have to spawn many threads and therefore efficient thread management will be an issue. It is also possible to have an elastic cloud solution using  a large number of small computing instances, which can be created and release quickly.

Device management

Digital services don't want to directly deal with the device management aspects but may want to invoke the device management capabilities via APIs either to make sure that the device has the proper configurations/profiles and/or proper binaries/payload running in the device  or to assess the static/dynamic properties of the device (screen size/battery status).

The device management relies on the fact that a device is a connected device and therefore will consume the exposed device connectivity API.

Requests for downloading specific payload to a single device or a group of devices which may lead to working on a specific schedule based on the number of devices or on dynamic characteristics of the network at the time of the actual download, is an important aspect of a proper device management but need to be fully abstracted in order not to put the burden on the consumer about the complexity of the tasks.

Similar to the device connectivity, how the management capabilities are implemented must be hidden from the consumer of the capabilities, and may imply software elements in the device, on a  server, or a combination of both  and the distribution of these software elements will vary based on how smart the device is, how good the device connectivity is and what the consumer of the information actually need. For example some of the device dynamic properties may be pre-fetch (sort of reverse DNS) and cached on the sever side and will be what the consumer of the device management will see instead of accessing the device itself...

In order to improve performance and scalability, the device management needs to use existing utilities like CDN, caching,  device static properties repository and again based on the variability of the load must use distributed elasticity (up and down) of IT resources (computing, storage) in order to be scalable (economically and technically).

Device abstraction

 This is the final but most tricky part of the M2M platform since it sits on top of the device connectivity and device management (this one of the consumer of these components), and handles most of the true functional aspects of the M2M platform, which is to provide a proper description of the device/group of devices  independently to consumer of the platform. It is actually key to protect that independence, otherwise we are not dealing with a platform but a framework which generally have scalability (technical, operational and economical) issues. The device abstraction also has to be two-way from the device to the digital service and vice-versa in order to provide the full scope of the platform:
  • On the device to digital service way abstracting the data/events generated/handled by the device is the main task since it is key to provide the proper meaning of the data to the digital service that will consume it. An electric meter may send a number for the consumption which will be meaningless if the unit (Watt)  is not added, a most extreme case of abstraction is handled for some very low level devices which only give the information by doing a memory dump... the abstraction in this case will have to filter the memory dump to extract the right information described on the API. Since the digital service may have requested the download of a specific payload inside the device in order to establish a high level relationship between the device and the digital service, the abstraction may become a path through or a stage as part of the payload execution, however this generally completely opaque to the platform  considering the specific functionalities embedded in the payload (again focus on the independence between the M2M platform and its consumers).
  • On the digital service to device way the actions may be handled in stage in order to cope with the limited capabilities of the device to handle the request or the consequences of the request. A digital service may need a specific payload to be rendered by the device (for example a user experience) and the device abstraction may perform some pre-processing before actually invoking the physical device itself (opera mini type). A payload needed by the digital service may define  its own API which may or may not be accessible by other digital services.
Depending on how smart the device is, the device abstraction may be pushed to the device itself. In which case the server side will only be a way to discover where the device abstraction end point is.

The grouping of devices is important but needs to be approached with caution since it may not be very useful (an potentially confusing/limiting) to perform  grouping within the M2M platform while this grouping could be done by non-specific computing platform and most likely will be what the developer is used to instead of depending on the M2M platform to do that. Since the M2M platform is representing by a series of API, it is very important to understand that many existing platforms (either as off the shelf technologies or as  “as a service” components) are fully capable of aggregating exposed API from the M2M platform and other platforms and will be most likely used by developers instead. The definition of the grouping performed by the M2M platform is therefore important and will cover the following:
  • expression of complex device model....a device may be described as a unique entity called the “simple” device (because it is simple enough, or because the device itself (no matter what is its level of complexity) does not allow to go deeper than one level), however other models of devices may need to be described as a set:: 
    • a physically bounded set of devices that share the same connectivity or are bound together because a specific physical condition (mobile phone (sim, phone), vcr (scheduler, recorder), a car, a plane etc…), 
    • a gateway device representing a collection of devices that are or not directly addressable, (a home gateway, a mesh network hub… ) 
    • (will be interested have an exhaustive list of device models of better a canonical set which by composition is identifying each model ). 
These sets can be cascading since a complex device is a device and therefore can be added in another complex device. This mean that to access the complex device, the device connectivity, device management, device abstraction

API will have to be defined, either on their own or as delegated to the API of one of the devices of the set. Each of these form of groupings are important since the digital service that accesses the set of devices will be capable to understand the level of complexity (constraints, properties(static/dynamic) of the sets…) and act on it.. How far can we go in the modelling provided by the M2M platform will have to be assessed since one a key reason of a platform is to simplify and also we have to take in account that even this modelling or part of it can be defined outside of the M2M platform (aggregator business).
  • grouping with a specific context (geography, political, administrative, topic etc...) mostly for the digital service to device way, a digital service may want to update all the devices of a specific region so instead of sending as many requests than the devices in the group, only one request is sent. This form of grouping is not a device but has specific API to describe the properties of the group. When dealing with REST architecture (which should be the way we handle this type of platform) a device is a REST resource and it is convenient to have the notion of group of devices as a REST resource too and this implemented by the platform, but in this case this grouping is not to aggregate information but provide a mechanism to easily discover the resources contained in the group. This is similar to the notion of group when dealing with files then the associated group notion is folder, with photos the associated group notion is album and for contacts the associated group notion is address book. etc…
No matter the form of grouping, it is very important to define either the device model that the grouping implies or define the context which defines why the grouping exist.

Extended View of the Platform

In order to complete the M2M platform it is necessary to handle others domains that the M2M platform will used for implementation or rely one for defining solutions:

API exposure

Once the API for the devices (“simple” or complex) and groups of devices are implemented, it is important to expose these APIs in a way that it is compelling to developers and also perform the necessary tasks that are specific to exposing APIs like:
  • developer registration, 
  • authc/authz on API, 
  • application registration (not download: API exposure is not synonymous to Application store), 
  • API business model (even freemium is a business model)
  • API throttling(per developer/per application)
  • API metering
  • API sandbox
  • API documentation presentation
  • Code sample , SDK etc…
All of these tasks are very generic and should not be associated with the semantic of the API themselves and off the shelf platforms (generally as part of the API management domain) exist to handle that. It will be important for the M2M platform not to replicate such functionalities but more rely on a logically centralized (physically very distributed for scaling reason) facility which will be shared with other platforms, providing a complete set of API that can be used independently or as a composite.

API Consumption

The final step which is out of scope from a platform perspective but in scope for defining an end to end solution is the API consumption. This step defines what and how  solutions are being developed on top of the set of expose API. There are many ways to consume API in order to create a solution:
  • Client mashups (all the logic of the solution runs in a client and direct call are made from  the client to the API).
  • Front end aggregation (all the logic runs on a front end server on behalf of the client and call to API are made from the front end server
  • Cloud mashups (the logic of the solution runs on a server and  the call to APIs are made from the cloud)
  • API adaption (API form the platform are adapted to serve a specific purpose within another platform on top of the M2M platform)
  • API aggregator/broker (aggregate different APIs to create a new API or aggregate many APIs that expose the same operation to just one end point)
  • ….
And of course many solutions are generally a combination of all of these ways. There are also a large amount of platforms helping the development of solutions, either as software components (products from software vendors or off the shelf managed open source components) or as service (force.com, app engine, azure….). Therefore it will be important to let the solution developer use the best platform for what is needed and what the developer is familiar with instead of imposing a specific model that will force change of behavior. It is also very important to define a clear scope of the M2M platform as a set of expose APIs around device (connectivity, management, abstraction/”simple”, complex/single or within a group).

Defining vertical solutions is key for the success of the platform however the solutions should really use the platform instead of taking a silo approach. As many start ups describe it: “develop horizontal, sell vertical”, which shows the tension between the platform and the solutions but also indicates that solutions are Trojan horses for the platform. More and more the APIs/platform are as important than the solutions themselves and should be delivered at the same times.

Cloud environment:

The different components of the system (device connectivity, device management and device abstraction) must have separate level of scalability also may have to be elastic (up and down) in order to cope with the variability of the load.

-          The device connectivity has the task to handle a large amount of active connections without necessarily handling a heavy computing load and for that reason it may be needed at the server level to have an massively distributed elastic pool of small instances.

-     The device management and abstraction look more like a classic service solution with a variable load with  potential high load (payload download) therefore could be handled by an elastic pool of medium/large instances, but with reduce level of I/O.

This platform should therefore not look as a monolithic system but a set of cloud based sub systems with different cardinality reducing  from the edge (device connectivity) to the center (device abstraction) and converging to an API exposure system that presents what the platform is about. Each of these cloud based sub-systems have specific scalability requirements that are handled via elasticity and instantiation of specific IT resources.

With the emergence of edge cloud IT resources, it is clear that the M2M platform could be a prime consumer of such  IT resource since it is important for offload as much as possible centralized data center IT resources and it is improving latency requirements which is a key aspect of M2M solutions.

Analytics


On top of handling specific aspects via API, the nature of a platform is to generate relevant data about the activities within the platform. Many levels of information can be produced, at the infrastructure (including network, virtualized OS), application and service level , and each subsystem must be treated as a source of information. While it is not necessary possible or practical to specify a single format for all the data generated , some common tags need to exist for making correlation (horizontal or vertical) to exist.  The data is dumped into a analytic environment either in real time or in batch mode and knowledge extraction (value) is performed. The results of this knowledge extraction must be used by the platform itself  (self-improvement, feedback mechanism, analytic based elasticity…) and by other systems (monetization..).

Conclusion

As a summary, a M2M platform actual focus is to handle devices as service (synonymous to device as a resource in a REST terminology) and therefore will implement in a cloud environment three types of API per device: connectivity, management and abstraction. The device can be a “simple”/complex device or a grouping of devices based on a specific semantic which by itself will have specific API
The platform should then use off the shelf software/platform to expose the API.

Once the API exposed, the consumption can take many forms and it should be clear that while we need to implement end to end solutions, the solution will used many other component/platform than the M2M platform itself.

Thursday, May 30, 2013

SDN and Network abstraction take 2

Ok a while ago wearing a network hat and arrived with that view of network abstraction on top of the network virtualization as a consequence of SDN (see previous blog). Some of the results looked more like a classic approach on exposing network assets as APIs and can actually be done without using SDN. But looking at the first category of abstraction, I said to myself that there must be something disruptive and game changing by implementing SDN, so wearing a IT hat I arrived with a more complete thinking about network abstraction :

Major internet service providers like Google already have very distributed core network of IT resources (based on commoditized blades) and they have realized that this network had to be very dynamic, programmable, and low cost and therefore using SDN/Openflow or equivalent is a way to handle these problems.

They also are spending a lot of efforts in expand this core network with remote IT resources by:
  • giving SDN racks to be installed in other networks, (viewed as a CDN solution for Youtube)
  • providing home/enterprise gateways to be attached to operator networks or within enterprise networks (viewed as a TV solution (GoogleTV) or as a search optimization solution (Google Appliance)
  • providing downloadable code to end user devices viewed as a browser/device bases service platform (chrome) or as search expansion (google desktop)
  • and adding openflow like functionality in android which means that any android devices can be viewed as an IT resource....
Each of these activities provide a service to the user/enterprise (TV, CDN, optimization etc…)  and at the same time drastically expand their core network of IT resources.  This pushes the network providers to become a forwarding plane managing opaque tunnels (I will prefer canals term see why later) carrying high level/value signalling/payload between the remote IT resources and their core network.

While this situation is basically unavoidable and could appear as yet another form of disintermediation, network operators can also change the game as a consequence of implementing SDN. Network operators are implementing SDN to reduce cost by performing two actives:
  1. Virtualize the hardware of each network element (base stations, firewalls, routers, switches etc…)
  2. Create distributed controllers to make the network easy to configure (ultimately software based), cheap to manage.
The second activity is core since there is a need to quickly and dynamically establish paths (route, vpn etc..) in the network but this will be completely hidden to developers. The virtualization activity can be used to create IT resources (using a de facto abstraction standard like AWS specs) that can be used by developers to run their machine images or implement storage buckets. The abstraction  is implemented as an extra workload on the network element virtualized hardware and a broker (similar to the brokers for hybrid cloud implementations) is needed to find the IT resources either in an explicit way of based on rules/policies or even analytics.  A cloud watch equivalent must also be implemented as part of the abstraction to give IT resource feedback to the broker.

This means that the network abstraction result for using SDN is a massively distributed data center with very interesting IT resources called: edge IT resources.  Of course the network based IT resources will have constraints and therefor will makes them different from the data centre based IT resources, but having different IT resources is not a problem for developers (e.g. AWS has more than 17 EC2 instance types).

The network based IT resources and more specifically the edge IT resources are game changing because: 
  • Only a network operator can implement this type of edge IT resources.
  • We are dealing now with IT resources that are 1ms away for the devices hence reducing the need of running workloads in the device while still  maintaining a very responsive user experience hence avoiding unnecessary power and bandwidth consumption.
  • Offload the core network of IT blades of the small workloads that are created to pre-empt activities performed on mobile devices (these small workloads have similar size of overhead than the large workloads and with the proliferation of mobile devices (e.g.: Kindle for Amazon) there is an important increase of the ratio "total overhead"/"total effective workload" thus reducing the value generated by the core network of IT resources. This is a similar problem known by the network providers with the header versus payload ratio.
  • Network operators are not just canals handler, but a part of the IT supply chain which increases their visibility of the high value activities at the software service level.
In this situation network operators are not just canals handler, but a part of the IT supply chain which increase their visibility of the high value activities at the software service level.

From an infrastructure perspective, the developers will see a continuum of IT resources from the back end to the device including the network instead of having to deal with the network as an IT no man’s land which today has a great negative impact in particular on mobile or M2M solutions.

The way software services need to be developed in order to fully benefit of the continuum of IT resources should actually not be a problem for cloud developer since they are already handling the following:
  • API: developers are already familiar with consuming API (internal or 3rd parties) in order to develop their solutions and therefore they know how to develop solution based on loosely coupled software elements.
  • Elasticity: cloud developers are already aware of the complexity/benefits when dealing with a cloud infrastructure by using elasticity. It appears that there are different sophistication levels on elasticity:
    • Level 1: request or release of IT resources to match the load (available services)
    •  Level 2: use of elasticity to handle IT resource failures while still coping with the load  (resilient services)
    • Level 3: broadening the criteria to trigger elasticity process (dynamic services)
    • Level 4: depending on a hybrid cloud broker to get IT resources in different part of the accessible continuum of IT resources (nomadic services)
    •  Level 5: analytic based criteria to trigger the elasticity process and analytic based broker to get the IT resource (liquid service),
    • Many cloud developers are already pass beyond level 3.
  • API discovery: when software elements move (e.g. software element following roaming devices), an explicit or implicit “DNS” like system pattern may be needed to discover where the new API endpoint. This  is similar to what has been developed and deployed by Apigee API Exchange.
  • Resource discovery: data access also needs to be handled appropriately in order to make sure that the data can follow the software elements. XRD based index can be used for that. This may not solve completely the consistency problem of the data but developers are already familiar with dealing with inconsistent or almost consistent data by the time the level of consistency is known.
Today because of the IT no man’s land, canals have to be built thru the network in order to let the liquid services move closer to the devices. Making the network (operator/enterprise)  a massively distributed data center as an abstraction consequence of implementing SDN, will be like opening levies and letting the existing and future liquid services expand up to an edge that is very close (1ms) to the devices and therefore to the user providing human touch like user experiences which are advertised by 5G networks. This may imply that 5G will actually not be solved only by improving the radio/network but also by a completely different IT approach and a maturity in distributed software towards liquid services.

SDN and Network abstraction

SDN and network abstraction is a very interesting topic since it will have the same effect than amazon web services had on IT virtualization. 

I believe that network virtualization is clearly on its way and is getting the maturity needed for more widespread expansion; the acquisition of Nicira by VMware is a good example of that.

For the network abstraction it will be tricky to define what is the equivalent of the AWS specifications for cloud computing, but we can try by first decomposing the different resources the network virtualization implements. Three different aspects of the network can help this decomposition:.
  • Network as a set of computing resources will be about the same resources as the one exposed by amazon specifications: computing, storage, queuing ... however the attributes of these resources are different: more distributed, lower latency, limited quotas, etc. Using base station computing to recreate a massively distributed EC2 implementation will have a clear effect on latency to mobile devices..
  • Network as a facilitator of accessing/supporting cloud computing resources will be about how the current cloud solutions can use the network to work better: E2E QoS (including control of the bandwidth and latency), connection (a mix of network and OTT approach on how to handle push/pull events to and from devices)... are the type of resources that need to be exposed. Better cloud service to cloud service interactions on one side and better power management on devices will two aspects that will be impacted by the availability of such resources.
  • Network as specific resources will be about the more classic view the network has to offer: communication intent (between entities: users/devices...), identity, metering/rating are the type of resources the network we agree used or should be used to see. I am mentioning communication intent instead of messaging or call control (see previous posts...). 

The combination of these three aspects will finally allow us to really implement device as a service and this not only for user to service via device interactions (mobile or not), but also device to device or device to service interactions (two different forms of M2M).

Now what we need to avoid is the confusion between network virtualization and network abstraction (cloud) since both are two different approaches of the situation. Virtualization is generally a bottom up approach while abstraction (cloud) is a top down approach. Both needs to work together to achieve the same goal...