PDA

View Full Version : What class name should we use for the unit of work?


Dave Syer
Jun 28th, 2007, 09:41 AM
There is a central concept in Spring Batch (in all batch processing) of the unit of work. Usually an item is read from an input source and processed. The item might be a message, line from a file, record from a database table, etc. The combination of the input and output is common, and there are also jobs which combine the two into a single operation. Transaction boundaries are usually set at the level of several of these units of work (a "batch", or a "chunk"). The interface is pretty much fixed:


public boolean execute();


implementations return false to inform the framework that there is no more processing to be done (data are exhausted).

Some pros and cons: "unit of work" has definite connotations of a transaction (which we are only a part of). "Module" is the name used at the moment, but it is a) bland, b) clashes with other uses, principally in OSGi. So we hate it, but it's on the list in case anyone actually disagrees.

Alarmnummer
Jul 10th, 2007, 03:29 AM
Module: not clear

UnitOfWork: hmmm.. difficult... record vs chunk.. what does it mean :)

WorkUnit: same as unit of work. And using UnitOfWork for a chunk and WorkUnit for a 'record' would be confusing.

LogicalUnitOfWork: not logical!! I always hate it when I see logical stuff :)

Action: not clear.

Executable: gives me the feeling I'm working with a concurrency library instead of a batch framework.

BusinessOperation: hmmm.. not clear

BusinessTask: not clear

My favorite is UnitOfWork although I completely agree with it being overloaded.. Maybe record, item, message, workatom, atomofwork..

Andreas Senft
Jul 10th, 2007, 04:03 AM
I also do not like module, as it says everything and nothing. UnitOfWork sounds got, but is indeed associated with transactions somehow. Action and Executable are too generic in my opinion. I agree with Peter that LogicalXxx is not a good name.
BusinessTask and BusinessOperation are quite ok. So I from the given list I would choose BusinessTask.

May I, though, propose another alternative? If we are talking about executing a batch in one transaction and fragmenting this, how about naming one fraction of such a batch a BatchPart?

Regards,
Andreas

pgras
Jul 10th, 2007, 06:58 AM
I would suggest Operation or OperationUnit...

In the given list I vote for BusinessOperation...

-Patrick

Dave Syer
Jul 10th, 2007, 11:51 AM
I like all the suggestions. Keep them coming.

We prefer not to use Batch* on balance because it is the name for something that is supposed to be free of the framework (OK so it's a framework interface, but you get my drift). The developer is going to say "OK, where do I implement my business logic?"

So I'm pleased that people are on the whole in favour of Business*. Let's hear a few more arguments...

Andreas Senft
Jul 10th, 2007, 03:16 PM
Hello Dave,

I think you have a point with Batch* names. I'm still pondering on that aspect that the <not named yet> is just a part of a larger whole. Maybe that should be reflected somehow. What do you think of "BusinessTaskPart" or "BusinessOperationFragment"? It's a bit more unwieldy, but it makes explicit the fact that there might be more of it before the task/operation is complete.

Still thinking...
Andreas

Dave Syer
Jul 11th, 2007, 03:34 AM
One of the constraints is actually that the name shouldn't be too long, or should be capable of being abbreviated. We find that we use the <name of thing> quite often in discussions about the framework, and about batch processes that are being developed. You have to be able to say something like "StepExecutor uses a RepeatTemplate to iterate over the XXXX executions", and not get your tongue tied in a knot. That is actually my only objection to Business*, but mostly the proposals we have there can be shortened to the * part and they make perfect sense.

Andreas Senft
Jul 11th, 2007, 03:46 AM
Thanks for refining the driving forces in the naming process, Dave. I just wonder why you didn't include "Task" in the poll options then.
Actually I would refrain from choosing a longer name when in conversation it is always expected to be abbreviated. It might then be better to name it in its short form in the first place.

So I throw in another two suggestions: "Task" and "TaskPart" (which I like as of my explanation above).

BTW: I really appreciate your effort of finding fitting names. There are too many poorly named classes around.

blackstar
Jul 11th, 2007, 02:36 PM
Job? WorkItem?

Alarmnummer
Jul 12th, 2007, 04:15 AM
Something with Atom could be a possible solution: atom is the smallest amount of work that can't be divided (within a certain context).

BatchAtom.. ChunkAtom.. WorkAtom.

Dave Syer
Jul 12th, 2007, 10:41 AM
Hmm. The problem with that is that it often can be subdivided (e.g. into a read plus a write). Then again so can a real atom be subdivided (nucleus and electrons). I don't think I'll continue with that analogy...

sdmiski
Jul 18th, 2007, 10:10 AM
How about BatchTask ?

gmatthews
Jul 18th, 2007, 05:34 PM
Thingy or Job are front runners for me.

Dave Syer
Jul 19th, 2007, 02:20 AM
Does anyone have any strong opinions about Tasklet? It's easy to say, and doesn't clash with other frameworks (as far as I know).

gmatthews
Jul 19th, 2007, 03:38 AM
Hi Dave,

Tasklet sounds a bit lame.

"I'm just going to run a Tasklet to process 50 billion tax returns..." doesn't sound quite right. If you go with Tasklet, they'll need to have a built in constraint to only be able to process 2 or 3 fairly simple things :-)

I wondered why Job didn't appear in the survey list. Are you trying to avoid clashing with Quartz?

Job is fairly well understood, is quick to type, and doesn't over or undersell itself on the scope of what it might be doing.

Dave Syer
Jul 19th, 2007, 03:42 AM
We already have a Job - it's the opposite end of the spectrum - the most divisible object in the domain. A Job is a list of Steps, and each Step is an iteration over a <insert your favourite name here>.

Tasklet makes some sense to me, but we are still open to suggestion. I dont understand the comment about the "built in constraint". What would be the point of that?

gmatthews
Jul 19th, 2007, 03:49 AM
I dont understand the comment about the "built in constraint". What would be the point of that?

It was a (bad) joke.

Seriously though, can Step be recursive or nested indefinitely?

--- edit ---

or can you just nest Job objects indefinitely.

I was just having a conversation the other day with someone about how to go about introducing dependencies between Quartz Jobs.

It seems like a potential design flaw to name each level of nesting differently. I'm likely wrong but initial impressions are that everything is a Job.

Why do you want to have different names for each level?

Alarmnummer
Jul 19th, 2007, 04:17 AM
We already have a Job - it's the opposite end of the spectrum - the most divisible object in the domain. A Job is a list of Steps

What is the difference between a Job and a Batch?


and each Step is an iteration over a <insert your favourite name here>.

And Steps sounds to me like a Chunk. A bunch of 'records' that are going to be processed.


Tasklet makes some sense to me, but we are still open to suggestion.

I don't like the name Tasklet very much either. It would be something Sun would coin op when the run out of imagination ;)

But what about BatchRecord? BatchUnit, UnitAtom, ChunkPart, ChunkUnit ,OWN (Object Without a Name.. we use the same approach in Holland for music bands).

Andreas Senft
Jul 19th, 2007, 04:28 AM
Hmm. Unit seems not bad. Or just Task without "-let".

lucasward
Jul 19th, 2007, 05:50 PM
Just to clarify quickly, Job has one to many Steps, which has a one to one relationship with Module/Tasklet/Unit, etc. The main reason for the separation between a step and Module is to provide a clear touch-point between the framework and the developer. The Step controls transactions, storing status, etc, and delegates to the developer via the Module, giving them an opportunity to execute their logic within the constraints set forth by the step. I think, overall, some form of 'Task' is the frontrunner, name-wise. For me, Task seems like one unit; I call a task, it runs and finishes. Using the analogy provided in a previous example: "Call a task to process 50 million financial records". However, that's not what's happening, you're calling a Step to process 50 million records, which delegates to the Task 50 million times. This is where adding the '-let' to Task helps deferentiate that it's not the whole thing, but rather a part.

However, at the end of the day it's all just semantics. It's no different than the decision to use 'Servlet' as a name.

The discussion regarding 'Why split a job up into steps?' is an interesting one, and I can understand there being a little initial confusion at first. In most batch applications there always ends up being 'psuedo work streams' that exist solely because 3 or 4 jobs must be executed in sequence before any other action can be taken. For example, Job A loads data, Job B validates, Job C performs an important set of calculations, and Job D outputs or creates reports, etc. This *could* be done in one job, but usually there are important reasons why they are separated out. Now the scheduler (and thus the operators) must know about and manage all four jobs, when realistically, nothing further can happen until Job D finishes, they are effectively one job, with a lot of unnecessary complexity moved into the run tier (i.e. the scheduler). Instead, having one Job with 4 steps makes much more sense. There has been talk about creating 'collapsable' or nested Jobs that would allow you to have only the Job if there wasn't any separation needed, however, when we looked at implementing it, there was a lot of added architecture and configuration complexity, just to keep from having a Job with one Step.

I hope this sheds some light on the terms we're using and the motivations behind them.

gmatthews
Jul 19th, 2007, 06:32 PM
I guess having definite hierarchy levels is ok if you assume that TaskLet, Chunk, Step, etc, are going to be delegating to Spring services, and that each time you want to create a new batch job, you're going to be creating new TaskLet, Step, Job, etc, implementations.

For example,

Batch Job A might have a Step that delegates to Spring Services X and Y, whereas Batch Job B might have a Step that delegates to Spring Service X only.

In this case, the different Step implementations in Batch Jobs A and B are little more than wrappers that allow correct handling of transactions.

If the expectation was that each Job or Step implementation was going to contain any real logic, then I'd stick by my point that the Spring Batch team choosing an interfaces approach to allow definition of where steps begin was still a bad idea IMO -- and that the "everything is a (nested) Job" and "use annotations or Spring configuration to control transaction/retry demarcation" is a more flexible.

So, another question...how does the @Transactional annotation fit into Spring Batch?

Dave Syer
Jul 20th, 2007, 01:42 PM
The low-level infrastructure components that I blogged about (http://blog.interface21.com/main/2007/05/07/spring-batch/) provide the "everything is a (nested) Job" and "use annotations or Spring configuration to control transaction/retry demarcation" features that you mention. I expect them to be quite valuable in a range of optimisation situations. And @Transactional fits into that world in the same way it fits into any other Spring application.

Job/Step/Tasklet are higher level domain concepts, where the value the framework is adding is more to do with reporting, management (lanching, stopping, etc.) and monitoring specific job executions. They do a little more than correctly handle transactions, but if that stuff isn't valuable to you then the infrastructure is available. Or we'd be happy to consider new use cases for the framework.