DPDM Architecture

De Assothink Wiki
Aller à la navigation Aller à la recherche

DPDM among three possible architectures

DPDM stands for "Distributed Processing/Dispatching/Monitoring".

The 3 possible architectures are:

  • Java emulation of assothink on 1 machine, 1 process, multiple threads. This is the first tested architecture of assothink. It ran from 2010, on the Matscape machines.
  • The dream: a dedicated set of millions of microchips, each working autonomously. This might be suggested to some company producing wide scale IC (integrated circuits) products. This would be the most achieved and performing version of Assothink.
  • Intermediary, DPDM. Many machines connected in a TCP-IP network, one or many small process per machine. Java emulation.

Its is important to note that the passive jelly construction has nothing to do with these 3 architecture. The construction of the passive jelly is a previous step.

Why DPDM?

The idea of DPDM comes from the new availabilty of very small (and very cheap ; less than 50$) small Linux computers. These small computers have the size of a credit card or less. They mainly have a network and some USB interface. They run Ubuntu and have 0.5 Gb of RAM. (This is written in October 2012!).

It becomes then possible to organize .. 8 ... 256 ... connected small machines to emulate Assothink at a reasonable price.

Besides the very small computers, there is also the possibilty to use obsolete cheap computers with Linux & JVM installed. However this would have 2 drawbacks: (1) heterogeneity in the network (2) power consumption maybe significantly expensive compared to hardware.

Conclusion : DPDM offers reasonable pricing ans scaling capabilities for an intermediary solution.

DPDM components

The DPDM components are

  • 1 ADU (associative dispatching unit)
  • n (... 8 ... 256 ...) APU (associative processing unit)
  • 0 or some AMU (associative monitoring unit)

All components are able to send messages using a message delivery system (MDS)

MDS (message delivery system)

Instead of the expected request answer scheme, a message delivery system is chosen. Request-answers are eliminated because in most case the time required to produce the answer would result in the requester side being blocked and waiting, or forced to use multi-threading to process all concurrent dialogs.

Actually the MDS is realized this way:

  • all APU and the ADU are socket servers (they have a thread dedicated to this).
  • all answers given by the socket servers are immediate and minimal
  • the full answer formally comes in a delayed message available (and sent) later
  • the message transmission is 'asynchruonous'

A specific class netdaemon.java (Assothink network daemon) is the parent of the ADU and APU classes. Components of this class have 2 threads, an abstract running class, a message input stack, etc... The listening thread reads and process incoming messages, and uses a ServerSocket. The writting threads achieves most computations and send messages using a client Socket. Most tech doc is available in the java.net package documentation (http://docs.oracle.com/javase/7/docs/api/java/net/package-summary.html).

ADU

The ADU receives, sends, dispatches excitation signals between al the l APUs.

It does not compute much.

It keeps a copy in memory of 

  • signals between nodes
  • excitation levels
  • list of APU (net adress, node set)
  • global vars

The associative dispatching unit requires

  • significant memory
  • very efficient nework I/O
  • average CPU capability

APU

The APU computes excitation levels of the nodes, receives and sends excitation signals from and to the ADU.

The ADU requires

  • some memory, proportional to the number of nodes, maybe not so much
  • efficient socket networking as server
  • CPU intensive resource : integer and floating point computations as fast as possible

The APU process runs 2 threads.

  • The listening thread receives and processes all messages.
  • The computation thread actively updates excitation levels and sends excitation output. 

The message received by the APUs are various:

  • input excitation
  • permeability updates (possibly)
  • directives 
  • a set of monitoring data requests to be reported (speed measure, memory measure, cycle count...).

The message sent by the APUs includes

  • output excitation the excitation signals to be propagated
  • the monitor data to be delivered according to the request

The typical directives are

  • halt (force)
  • restart (force)
  • hibernate (force)
  • resume from hibernation (force)
  • updates of tech parameters interactively set in the UI of the AMU (for instance number of computing cycle to be processed between message sending, smoothing constants, etc...)

The only message partner of any APU is the DPU.

The APU is I/O oriented, much more than computation oriented!

MPU 

The MPU is the only interactive component of the arcitecture.

Its main purpose is the graphical display of the DPDM Assothink components.

The MPU sends and receives messages to the DPU.

It is also used to

  • start, stop, configure the parameters
  • view excitation levels and signals
  • monitor all kind of stats for the DPDM components.
  • create specific excitation inputs
  • update permeability figures

Overview of DPDM component relationships

To summarize the DPDM relationships

  • APU and DPU are necessary daemons
  • MPU are optional, intermittent interactive programs
  • APUs and DPU act as servers for the MDS
  • DPU speak to all APU
  • APU only speaks to DPU
  • MPU has a user interface and dialogs with the DPU 

Hardware

All computers are on the same LAN, preferably on the same switch.

The DPU and MPU(s) run on standard computers. The DPU should receive significant resources (network I/O and memory).

The DPU machine includes an apache server.

The MPU run within browsers (GWT app).

The numerous APUs run on very small computers without screen, accessible with remote shells for admin, and thru sockets. Regarding software, these machine just need a linux OS, a JVM, network connectivity, and the mounting of a shared disk present on the DPU (DPDM shared disk).

Network bandwidth analysis

2 options have been considered for the transmission of excitation:

during any cycle, all APUs send excitation levels to all other APUs (all-to-all). With this option the number of packets sent per cycle is roughly Nu2.
during any cycle, all APUs receive excitation levels from the DPU, and send excitation levels to the DPU (all-to-1-to-all). With this option the number of packets sent per cycle is roughly 2 x Nu.
The second option has been selected. It is probably slower for small installations, but for larger installation, when Nu increases, it is more performing (considering also that the computing time decreases as the opposite of Nu).
But a critical point with all-to-1-to-all is : when should the DPU send excitation data to the APUs? The answer is complex, and depends on many factors (Nu, Nn, unit CPU capacity, switch and network capacity...). It will be answered later. In the meantime, a first answer is : as quickly and frequently as possible, as long as the network bandwidth is not saturated.

Network controller requirement

The net controller of the APU and DPU is critical.
A 100 Mbit controller is not enough (it sends & receives 100 Mbit/sec, or 100Kbit/msec, or 12 Kbytes/msec...).
A 1Gbit controller is goodh (it sends & receives 1000 Mbit/sec, or 1000Kbit/msec, or 120 Kbytes/msec...).

Power consumption

The power consumption of a typical APU (Raspberry Pi B) is 3.5 watt. With a price of the KWh at 0.20 €/KWh, this implies the daily cost of an Assothink DPDM set (APU part) is:
3.5 x 24 x Nu x 0.20 /1000 €/day = 0.0168 x Nu €/day
[ As a lateral consideration, let us note that for a device whose price is (for instance 35$), the power cost is as important as the investment cost after 35/0.0168 days, thus around 6 years. ]

APU memory requirement

We assume that
the emulator should handle Nn assothink concept nodes
the DPDM structure includes Nu APU computers
an average value of Nln links per node (Nln is close to ... 30..).
the process memory much requires more memory for the nodes than for the code and for working vars (code + working vars < 10 MB)
the OS consumption is below 20 MB
The number of nodes handled by one APU is Nnu = Nn / Nu.
The input memory (excitations) requires 4 bytes per local node, thus 4 x Nnu bytes.
The excitation state requires also 4 bytes per local node, thus 4 x Nnu bytes.
The output channels require 8 bytes per link, thus 8 x Nln x Nnu bytes.
The output memory (outgoing excitation levels) require 4 bytes per global nodes, thus 4 x Nn bytes.
So globally the required memory (in bytes) is M = 3x107 + 8 x Nnu + 8 x Nln x Nnu + 4 x Nn.
Taking 30 as value for Nln, this becomes M = 3x107 + 248 x Nnu + 4 x Nn
and then
M = 3x107 + (248 / Nu + 4 ) x Nn
or
M = 3x107 + (248 + 4 * Nu ) x Nnu
or
M = 3x107 + 248 * Nnu + 4 * Nn
For a small installation (Nn = 106 , Nu=10, Nnu = 105), the memory consumption turns around 60 Mb.
For an average installation (Nn = 107 , Nu=102, Nnu=105) the memory consumption turns around 100 Mb.
For a wider installation, the last term is critical. If Nn reaches 108, the memory consumption exceeds 450 Mb, for any number of APUs.
So practically 512 Mb should be comfortable up to 107 nodes, and the memory size of the available small computer is OK.

Disks

The APU do not need significant disk usage. The ADU needs at least the /DPDM disk.

Shared disk /DPDM

The shared disk (physical on the ADU, mounted on the APUs) contains at least
/DPDM/class : all class files
/DPDM/wake : wake up, wake down, pid files
/DPDM/cfg : the DPDM config file
/DPDM/pkFile : the pk formatted Assothink data
/DPDM/pkWorks ...
/DPDM/trace ...

APUs: suggested hardware

A good candidate (but neither the only nor the best one)  is the raspberry pi model B.

  • Linux, but NOT Ubuntu
  • 512 Mb ram, not more
  • 100Mbit ethernet (NO gigabit!!!)
  • various JVM available, best from oracle
  • 35$ (!)

More details on raspberry pi provided by wikipedia.

Alternatives: beagleboard, beaglebone, pandaboard... (generally more expensive).

Wake-up daemon

To java classes used by the APUs are present on the DPDM disk.

The APU permanently run a wake-up daemon, wich performs quite simple tasks (every seconds):

  • check if the local APU process runs (with a minimal alive request).
  • check the DPDM disk, and check the presence of wake-up signal files (/DPDM/wake/<host>.up) and wake down signal files (/DPDM/wake/<host>.down). Either of the 2 files should exist at any time. They are provided and deleted by the DPU.
  • in case of discrepancy between the APU status and the file status, launches the APU or kills it.

(the need to separate the APU from the wake-up daemon comes from the fact that in development mode, the java classes are frequently updated, and the updated APU daemon should frequently restart with modified sources).

When the APU starts it creates a pid file  (/DPDM/wake/<host>.pid) in the shared sisk to annouce its PID number. This is necessary to permit to the wake-up daemon to possibly kill it.

The wake up daemon should be designed to consume minimal resources.

DPDM config file

This file contains a set of java properties.

The most important properties are

  • DPU host name & IP address
  • set of APU host names
  • node range for all APU

DPDM cycle speed

A DPDM cycle consists on several computations performed on the APUs, and numerous IP packets to be sent between the various machines working together.

During 1 cycle, more than 2xNu data packets are transmitted.

The DPDM architecture is designed to achieve improved efficiency, i.e. 1 cycle per msec (1000 cycles per seconds) with 10 APUs, and hopefully ..5..10.. cycle per msec (10000 cycles per seconds) with 100 APUs.

This is to be compared with the basic architecture, delivering (after string optimization) a computation cycle around ... 10 ... msec. 

But, this question is critical: would network bandwidth and capabilities be a limiting factor for the DPDM architecture (more than the APUs cpu speed)?

It is assumed here that the LAN would be able to propagate 2xNu packets per msec. For 100 nodes, it should be able to reach 200 000 packets per seconds. See for instance this image for Gigabit controller performance taken from http://wiki.networksecuritytoolkit.org/nstwiki/index.php/LAN_Ethernet_Maximum_Rates,_Generation,_Capturing_%26_Monitoring  This to be explored!...

If network bandwidth becomes a limiting factor, several optimzations might be considered:

  • organize the APUs in K subsets, use K switches and K network controller on the ADU (traffic division)
  • work and optimize the emission process of the APUs to allow less (but bigger packets). This in turn has a negative effect on signal propagation speed.

Network : packet number and packet size analysis

The key question is : is it necessary to use APU with gigabit ethernet ports?

Typical target Remark
Number of nodes Nn 80 000 Later 10 time more ?
Number of APUs Nu 40 Depends on unit cost, CPU perf
Frequency F 1000 Hz (1 msec / cycle)
Active node ratio Ar 0.005 Active (excited) node Number / total node number
Active node total number 500
Nodes per APU Nnu=Nn/Nu 2000
Links per nodes Nln 20 Average value
Memory : signal out buffer 4xNn bytes 320K 1 integer per target
Memory: signal in buffer 4XNnu bytes 8K 1 integer per local node
Active nodes / APU Nau = Nnu x Ar 10
Signal-out generated / cycle Sout=Nau x Nln 200 actually less (redundancy is possible)
Bytes/signal 4

32 bits (20 to identify target, 10 to specify signal strength) 

Bytes-signal-out / cycle Bout=Sout x 4 0.8Kbyte
packet-out (on 1 APU) / sec
1000 / sec
bytes-out (on 1 APU) /sec Bsec = F x Bout

0.8Mbyte 6.4Mbit

bytes of data
bytes-in (on 1 APU) / sec 4 x Nnu x F x Ar

0.016Mbyte 0.128 Mbit

maybe optimistic?

Total data bandwidth (per APU)

Bapu < 8 Mbit

should be correctly covered with a 100 Mbit connection for APUs (but data framing modfies perf)

Total data bandiwdth (per DPU) Bdpu=BapuxNu  100Mbit

Limits

The DPDM archicture works within a LAN. Using distant computers through the web would produce poor results, because transmission delay would be too slow (a cycle being less than 1 msec). However the MPU may work trough the web.

Cost analysis

The goal of the DPDM architecture is to deploy Assothink in a cost-effective way.

  • With the basic architecture, one full computer (1000$) delivers around 100 cycle per second. And besides the computing time, there is a UI drawing time. (cfr performance figures in the browser page). The price ratio is 10$ per Hz.
  • With the DPDM architecture, 10 APUs (10x35$ + 500$ (switches, monitoring equipment)) would provide at least 1000 cycles per second, and possibly 10000 cycles per second with 100 APUs. This is based on (a) a slower simple CPU, but also (b) a smaller number of nodes to process. The price ratio is less than 1$ per Hz, thus 10 times more cost effective than the basic architecture.

As a conclusion, if the cost analysis and the tech analysis are correct, the DPDM deserves certainly development efforts.

Development

PG assumes that 2 months of intensive works would produce the concersion between the basic and the DPDM architecture. That is costly!

APU software summary

The APU software is writen in Java (or in C???)

The APU software includes 2 threads.

One thread is mainly a SocketServer, answering to various kind of request (see above), idle most of the time.

The other thread is CPU intensive and works on excitations and signals. 

The APU memory contains mainly Nnu nodes (mainly an excitation level), Nnu x Nln half-links, Nn output values, Nnu input values; also various parameters, and various variables to be reported to the ADU and MPU. Realistic values are computed above.

During the initialization phase, the link structure is loaded, and possibly excitation states (previously saved states).

The computing cycle performs simple tasks:

  • injection of input signals into excitations (signals reset then to 0)
  • computation of output signal values
  • sending (as socketClient) of the output messages
  • reading of socket answer, loading of input signals
  • excitation decrease (exponential decrease)

The code is very simple and very small. Probably the I/o consumes much more than the computations (about Nln x Nnu x Ar {multiplication,addition} per cycle, thus much less than a msecs.

Interesting Links

http://www.southampton.ac.uk/~sjc/raspberrypi/pi_supercomputer_southampton.htm

http://westcoastlabs.blogspot.co.uk/2012/06/parallel-processing-on-pi-bramble.html

http://en.wikipedia.org/wiki/Raspberry_pi