~openerp-commiter/openobject-addons/extra-6.0

« back to all changes in this revision

Viewing changes to etl/specs/prototype/README.txt

Committer: Fabien Pinckaers
Date: 2009-01-12 07:30:23 UTC
Revision ID: fp@tinyerp.com-20090112073023-t6pon9a1d16bycby

Adding_specs_etl

files added:
etl

etl/doc

etl/doc/index.rst

etl/specs

etl/specs/README.txt

etl/specs/menus

etl/specs/menus/etl.txt

etl/specs/menus/etl_sugar.txt

etl/specs/prototype

etl/specs/prototype/Makefile

etl/specs/prototype/README.txt

etl/specs/prototype/data

etl/specs/prototype/data/partner.csv

etl/specs/prototype/data/partner2.csv

etl/specs/prototype/data/partner3.csv

etl/specs/prototype/diff_job1.py

etl/specs/prototype/diff_job2.py

etl/specs/prototype/etl

etl/specs/prototype/etl/__init__.py

etl/specs/prototype/etl/etl.py

etl/specs/prototype/etl/operator.py

etl/specs/prototype/intermediate

etl/specs/prototype/intermediate/add.csv

etl/specs/prototype/output

etl/specs/prototype/output/partner.csv

etl/specs/prototype/test.py

etl/specs/prototype/test2.py

etl/specs/prototype/test3.py

etl/specs/screen

etl/specs/screen/etl_activity_data_sql.dia

etl/specs/screen/etl_activity_data_sql.png

etl/specs/screen/etl_activity_mapping.dia

etl/specs/screen/etl_activity_mapping.png

etl/specs/uml

etl/specs/uml/job_definition.dia

etl/specs/uml/job_definition.png

etl/specs/uml/process_running.dia

etl/specs/uml/process_running.png

etl/specs/usecase

etl/specs/usecase/test_case1.dia

etl/specs/usecase/test_case1.png

etl/specs/usecase/test_case2.dia

etl/specs/usecase/test_case2.png

Show diffs side-by-side

added added

removed removed

etl/specs/prototype/README.txt

Prototypes for the ETL system

=============================

Notes on the prototype

~~~~~~~~~~~~~~~~~~~~~~

The prototype prooved it's quite easy and fast to develop a complete ETL.

It takes about 15 lines per simple connector to develop. The whole prototype

including the framework and the connectors takes 150 lines, including the

following connectors:

* CSV input, csv output, data logger, data logger in one block

* Sort, diff, merge

The concept used by the prototype:

* Channel: name of transition used by connector to read/write

for example, the diff connector read two channels (original, modified)

and write to four channels: same, added, removed, updated

Summary of improvements to apply

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Don't put the logic in the node but in the job execution. That's the job

that should schedule the node function calls. And not the node itself.

* Add a system of triggering/listening events. Some events can be raised by

default: start, stop, end, error. Others can be user defined.

* Add different kinds of transitions. We implemented the data transition, we

will have to add the event/signal transition.

* Use a class to store the data, not a simple dict. We should be able to put

meta information on this class, so that information change from one node to

another. The class should also store the metadata of the current information.

(the list of fields and their type)

* Currently, I am passing the same data structure to all channels. It's efficient,

but when we split from one node to two, we have two pointers to the same data.

So if one node change data, it's also changed in the splitted branch. On the

transition, we should be allowed to decide if we copy() or not the data.

* Change the run execution on jobs and nodes so that it process one element at

a time and not the complete flow of elements. So that the job execution can decide

to stop running, run one element at a time to trace, or run until it's finnished.

* If you send an empty data to output, it does not go to input of the relateds

nodes. So that we have a system to manage loops and recursivity.

* Create a new node type which is sub-job or sub-process. It calls a new process.

* I implemented a push mechanism, we should also add a pull mechanism: a node

can request information to another node, and then receive the requested result.

Questions

~~~~~~~~~

* What's the best solution to store the data ? a list of dict is easy but may

take some place in memory. Is it better to use a list of lists ? We should

evaluate the difference in memory occupation. If it's less than 50%, we keep

list of dict.

Time Line

~~~~~~~~~

1. Finnish requirements

* Prototype (done)

* All menus

* All screens

* List of components to develop

2. Development

* Implement all objects/menus/views in Open eRP

* Improve the current prototype, integrating the above notes

3. Development of the Open ERP interface

* eTiny: generalisation of the workflow editor to create a new type of view

* etl addons: Integrate prototype logic on Open ERP objects

4. Develop real use cases

* Sage -> Open ERP

* Tally -> Open ERP

* SugarCRM -> Open ERP

The Process

~~~~~~~~~~~

1. Run

First, the job process calls run on each starting node.

The run calls:

* start()

* input()

* stop()

2. Start

The start calls all start of related nodes through outgoing transitions.

100

101

3. Input

102

103

Then, the run call input with the data it reads from sources (csv, sql).

104

105

The input process the data and calls output on resulting data, on a dedicated

106

channel. None channel means all channels.

107

108

The output calls input on related transitions with the data to related

109

nodes through outgoing transitions and channels.

110

111

4. The stop

112

113

The stop ends the process.

114

115

When all related nodes through incoming transistions have stopped, the current

116

node stops and propagate the stop call.

117

118

119

120

List of tests in the prototype

121

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

122

123

test.py

124

-------

125

126

data/partner.csv -> sort(name) -> output/partner.csv

127

-> logger

128

129

test2.py

130

--------

131

132

data/partner.csv

133

-> log('PartnerLogger')

134

-> output/partner.csv

135

-> log('OutputLogger')

136

137

test3.py

138

--------

139

140

data/partner.csv

141

-> log('PartnerLogger')

142

-> sort('name')

143

-> output/partner.csv

144

-> log('OutputLogger')

145

146

diff.py

147

-------

148

149

First Job:

150

* Perform a diff between partner.csv and partner2.csv and store

151

- intermediate/add.csv : added records

152

- intermediate/remove.csv : removed records

153

- intermediate/update.csv : updated records

154

155

data/partner.csv

156

data/partner2.csv

157

-> diff()

158

- csv.output(intermediate/add.csv)

159

- csv.output(intermediate/remove.csv)

160

- csv.output(intermediate/update.csv)

161

162

Second Job:

163

* Apply on partner3.csv to produce output/partner3.csv

164

- add records from intermediate/add.csv

165

- del records from intermediate/remove.csv (not yet implemented)

166

- update records from intermediate/update.csv (not yet implemented)

167

168

data/partner3.csv

169

-> merge(intermediate/add.csv)

170

-> filter(intermediate/remove.csv)

171

-> update(intermediate/update.csv)

172

Older »