PDA

View Full Version : CSV file process - multi-thread; which class to use?


epleisman
Sep 5th, 2007, 10:29 AM
Here is what I need to do:

- Receive file ( large CSV file )
- Import records using one TX but multiple threads ( split file? )
- The import process uses existing Spring/Hibernate objects.
- Commit or rollback depending on success
- Do NOT need restart.

QUESTION:
What Class / Method / Strategy best using Spring Batch?
Please advise.

Thanks!

lucasward
Sep 5th, 2007, 12:01 PM
Before diving into answering the specific points, I'm curious to ask why multiple threads are needed to load this file? What is the average file size to load? Is the large time to run the job related to processing? If so, using a staging table would be a good approach. In general, I would try loading the file using spring-batch without multi-threading (but using a huge commit interval, assuming the data is fairly clean) and if it isn't performing, start thinking about splitting the file, etc.

To move on to specific points, you could kick off multiple threads from the ItemProvider, and there has been some discussion about this, but we don't have any concrete examples to refer you to. A TaskExecutorRepeatTemplate could be used at the Chunk level, or a CompositeItemProvider could be used, but there would be issues Synchronizing the file with the transaction, since Spring's TransactionSynchronizationManager stores it's classes to notify in a thread local.

epleisman
Sep 5th, 2007, 02:53 PM
Processing time, in answer to your above question.
If I split the files - why run serially?
Prior to your reply I was considering using a queue and letting several threads work on it.

Thanks for your thoughts on this.

lucasward
Sep 5th, 2007, 07:48 PM
Are the files arriving split? or do you have to split them? If they're already split, then I agree that it makes sense to try and process them in parallel. If you want to do so within one job, you could use a queue in between the provider and the processor to help, but there would still be issues in synchronizing the disparate file input sources with the transaction.

If it's processing time that takes awhile, and not the I/O, I would still recommend loading the file directly into a staging table, then doing the processing you need once the data has been loaded into the database.

epleisman
Sep 6th, 2007, 08:44 AM
The file arrives UN-split.

If each process/thread can have its own transaction, then which approach do you see best? What is the advantage of the staging table?

I appologize for the questions. There are just a myriad of classes I see in the API which makes me need to understand the intended implementations for them.

thanks!

sotretus
Sep 6th, 2007, 10:51 AM
can each thread in a chunkOperations have it's own transaction (out-of-the-box)? I'm no expert, but from my (small) knowledge you may need to make sure the simpleStepExecutor's transaction manager is "dummy" and add transaction support around each repeatIterator (could be done with a RepeatInterceptor or with AOP around the chunkOperations Tasklet).

Just my 2 cents.
Regards
AB

lhotari
Sep 6th, 2007, 06:04 PM
If so, using a staging table would be a good approach.

staging table is a good idea if transactional integrity is required.
You might want to use database temporary tables for staging. Here's one article about this approach for DB2 database:
High performance inserts using JDBC Type 4 in a constrained environment: Leverage DB2 declared global temporary tables (http://www.ibm.com/developerworks/db2/library/techarticle/dm-0708calio/)

epleisman
Sep 7th, 2007, 10:07 AM
Ok, I am on board with that line of thinking.
I am thinking through the pros/cons of using a Queue as the repository (aka database).

lucasward
Sep 7th, 2007, 10:53 AM
I am thinking through the pros/cons of using a Queue as the repository (aka database).

Do you mean, using a Queue as your 'staging table'?

epleisman
Sep 7th, 2007, 12:30 PM
Yes - Queue as staging table.

lucasward
Sep 7th, 2007, 02:45 PM
There have been discussions about this type of approach, but I have yet to hear of an actual implementation. Unlike the approach of staging with a database table, which would require a single staging step first, then a second step to process using the staging table as input, you would need to write a special tasklet that would take the returned item from the provider, and put it in a queue, and each ItemProcessor would need to get the item to process from said queue. You would also need to make sure that there was some way to throttle the producer (ItemProvider) so that it doesn't accidentally add too much data and cause the queue to fail.

Again, I would only try this approach if you absolutely have to because of load issues. You could easily write an ItemProvider and ItemProcessor that could stay the same regardless of the solution, and try it without any additional threads.

sotretus
Sep 8th, 2007, 11:29 AM
We implemented our batch using only files (xml files, each file one record to be processed) and it is working fine. We did this in part because the existing Business Processes already handled transaction (and it was not easy to handle it from the batch right now) and because uploading XML to the database was not straightforward. Either way, if the transaction is controlled in the service and not in the batch, it doesn't make any difference either way.

We created some classes to manipulate files (renaming, moving, and the like) and we manage the state of the file by adding or changing names. The (big) problem with this approach is that the process may commit and the batch VM may fail before renaming the file being processed and hence getting an inconsistent state (it is also easily recoverable)..

We are planning on contributing these classes for manipulating files.

Just wanted to show another scenario where the staging table may not be as useful.

Lucas, do we have an example with multiple chunks? How would the stepOperations and ItemProvider be configured?

lucasward
Sep 13th, 2007, 12:37 PM
because uploading XML to the database was not straightforward.

I'm assuming this means importing XML files with spring-batch, if so, keep an eye out for some upcoming changes to XMLInputSources that should be committed tomorrow. Hopefully, it makes that scenario a little more straightforward and extensible.

We created some classes to manipulate files (renaming, moving, and the like) and we manage the state of the file by adding or changing names. The (big) problem with this approach is that the process may commit and the batch VM may fail before renaming the file being processed and hence getting an inconsistent state (it is also easily recoverable)..

We are planning on contributing these classes for manipulating files.


I would still not recommend manipulating files within your batch processes unless you absolutely have to. Instead, an EAI solution that would rename/move, or upload when a file is completed would be a better solution. I say this because, in my experience, file moving can cause a lot of issues that could needlessly hold up a batch stream, even though it generally has no thing to do with whether or not processing was actually successful.


Lucas, do we have an example with multiple chunks? How would the stepOperations and ItemProvider be configured?

I'm not sure I understand what you mean here, do you mean, an example of kicking off an ItemProvider in multiple threads? If so, we don't have an example yet. It's still strictly theoretical.

sotretus
Sep 13th, 2007, 04:06 PM
Lucas

Thanks for all the suggestions we will make sure to take them into account.
What I meant with the "multiple chunks" stuff is if we have an example where a chunkOperation RepeatTemplate is executed several times by the steopOperations RepeatTemplate. In a more functional view, this is when you want to commit several records together in chunks, but have multiple chunks of records to commit.

I.e.: I need to transform and update 1000 rows, but I want to process them 100 at a time.