1
Prototypes for the ETL system
2
=============================
8
The prototype prooved it's quite easy and fast to develop a complete ETL.
9
It takes about 15 lines per simple connector to develop. The whole prototype
10
including the framework and the connectors takes 150 lines, including the
12
* CSV input, csv output, data logger, data logger in one block
15
The concept used by the prototype:
16
* Channel: name of transition used by connector to read/write
17
for example, the diff connector read two channels (original, modified)
18
and write to four channels: same, added, removed, updated
20
Summary of improvements to apply
21
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
23
* Don't put the logic in the node but in the job execution. That's the job
24
that should schedule the node function calls. And not the node itself.
26
* Add a system of triggering/listening events. Some events can be raised by
27
default: start, stop, end, error. Others can be user defined.
29
* Add different kinds of transitions. We implemented the data transition, we
30
will have to add the event/signal transition.
32
* Use a class to store the data, not a simple dict. We should be able to put
33
meta information on this class, so that information change from one node to
34
another. The class should also store the metadata of the current information.
35
(the list of fields and their type)
37
* Currently, I am passing the same data structure to all channels. It's efficient,
38
but when we split from one node to two, we have two pointers to the same data.
39
So if one node change data, it's also changed in the splitted branch. On the
40
transition, we should be allowed to decide if we copy() or not the data.
42
* Change the run execution on jobs and nodes so that it process one element at
43
a time and not the complete flow of elements. So that the job execution can decide
44
to stop running, run one element at a time to trace, or run until it's finnished.
46
* If you send an empty data to output, it does not go to input of the relateds
47
nodes. So that we have a system to manage loops and recursivity.
49
* Create a new node type which is sub-job or sub-process. It calls a new process.
51
* I implemented a push mechanism, we should also add a pull mechanism: a node
52
can request information to another node, and then receive the requested result.
58
* What's the best solution to store the data ? a list of dict is easy but may
59
take some place in memory. Is it better to use a list of lists ? We should
60
evaluate the difference in memory occupation. If it's less than 50%, we keep
66
1. Finnish requirements
70
* List of components to develop
73
* Implement all objects/menus/views in Open eRP
74
* Improve the current prototype, integrating the above notes
76
3. Development of the Open ERP interface
77
* eTiny: generalisation of the workflow editor to create a new type of view
78
* etl addons: Integrate prototype logic on Open ERP objects
80
4. Develop real use cases
83
* SugarCRM -> Open ERP
90
First, the job process calls run on each starting node.
99
The start calls all start of related nodes through outgoing transitions.
103
Then, the run call input with the data it reads from sources (csv, sql).
105
The input process the data and calls output on resulting data, on a dedicated
106
channel. None channel means all channels.
108
The output calls input on related transitions with the data to related
109
nodes through outgoing transitions and channels.
113
The stop ends the process.
115
When all related nodes through incoming transistions have stopped, the current
116
node stops and propagate the stop call.
120
List of tests in the prototype
121
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
126
data/partner.csv -> sort(name) -> output/partner.csv
133
-> log('PartnerLogger')
134
-> output/partner.csv
135
-> log('OutputLogger')
141
-> log('PartnerLogger')
143
-> output/partner.csv
144
-> log('OutputLogger')
150
* Perform a diff between partner.csv and partner2.csv and store
151
- intermediate/add.csv : added records
152
- intermediate/remove.csv : removed records
153
- intermediate/update.csv : updated records
158
- csv.output(intermediate/add.csv)
159
- csv.output(intermediate/remove.csv)
160
- csv.output(intermediate/update.csv)
163
* Apply on partner3.csv to produce output/partner3.csv
164
- add records from intermediate/add.csv
165
- del records from intermediate/remove.csv (not yet implemented)
166
- update records from intermediate/update.csv (not yet implemented)
169
-> merge(intermediate/add.csv)
170
-> filter(intermediate/remove.csv)
171
-> update(intermediate/update.csv)