1 million rows

[got a few posts in quick succession this week, so please don’t be put off, and keep commenting on the old ones] 

This has come up a few times here and there so I thought I would come clean.

I have always said this is a bad idea. I would much rather have seen better db integration (maybe like the structured sheets I was on about). I know it can help in quick and dirty jobs, but actually most of those would be quicker and less dirty in Access/ db of choice. If you are dealing with >64k/1 million records and are unable to use basic database features, that would represent a significant operational risk in my view.

Around the time the 1M was announced one of the Excel team asked what my number 1 feature for Excel 12 would be. His eyes nearly popped out when I said reduce rows to 10,000 and improve db integration. More rows was the number 1 requested feature (by an order of magnitude I think). (This may explain why the ribbon is such a dumbed down interface!)

To my mind the whole spreadsheet paradigm falls down at large row numbers and you have to move to a more stuctured approach. If you have the same formula in every cell in a column, why not define it once as per a calculated field in Access, of a query in any db?

That said, I havent heard of any million row disaster stories yet, have you?

What do you think? 10k? 64k? 1M?, a screens worth? a user definable value?

Gnumeric has supported any maximum of rows and columns just by changing a couple of constants and recompiling for years (maybe for ever?)

As Excel gets more and more features for the lower ability user, I think it gets harder and harder to develop on, and less appropriate as a development platform. What do you think?

cheers

Simon

Advertisements

34 Responses to “1 million rows”

  1. sam Says:

    Simon,

    I have nothing against a million rows….or for that matter any no of rows or colums…as long as you upgrade other features to handle huge data…. filters…… calculation engine….. Pivot Table Calcs etc…..But this never happens….the other features always play catch up….

    Heres an example…
    The limitation of selecting a maximum of 8192 disconitunous areas has not been incresed in XL 12

    So if you have a large table in XL 12 …say 80 k rows. and if you filter on a certain criteria that gives you a table of more than 8192 discontinuous areas… neither can you copy from this table nor paste on it…

    Ideally all such parameters should be setable….
    Dont we set the default number of sheets in a new work book to 3…
    So why cant we set the number of rows and cols….
    Why cant we set the number of levels of undo…
    Why cant we set the cache memory of a pivot….

    Sam

  2. MikeC Says:

    I can’t help remembering a comment from Friday about setting VLOOKUP formulae against a million rows or data, then wondering why it takes forever to calc… ;-)

    I personally find 1 millions rows inappropriate for a spreadsheet, because the majority of average users of spreadsheets are going to be trying to run formulae against them. 65,536 was too many for this, and I’ve personally only once in the last 3 years used more than 50,000 – which seems like a good limit to me. (hey, it was a one-off!!!)

    Improved db integration, (particularly with Access, as it’s part of the same suite) would definitely make more sense in my mind, as this would allow the use of an expansive number of records if this is what’s needed. Currently this is a little complicated and scary for a novice user, so I understand why they try and put it all in Excel (and thus want a million rows!)
    MS have in this case, I think, taken the easy option by appeasing the novice users in increasing the number of rows, rather than making it more accessible for them to use the right tools for the job. Hmm, I could be talking about a few areas of Office 2007 there…. ;-)

    I haven’t heard any 1-million row horror stories as yet, but over the years I’ve heard enough about workbooks with dozens of sheets, each with 200-column, 65,000 row tables, with a formula in every field….!
    *cringe*

  3. Stephane Rodriguez Says:

    sam said “The limitation of selecting a maximum of 8192 disconitunous areas has not been incresed in XL 12”

    Isn’t it doable to create a defined name to factor a large chunk of those discontinuous areas? Then define the original range as an union of such defined names. I recon that it’s kind of doing manual nesting…

  4. Stephane Rodriguez Says:

    In regards to the one million row extension, one can easily see people importing large CSV files into Excel, and doing some analysis there. But frankly I think Excel should warn them that perhaps their use of the software is inappropriate. (Excel could even replace a CSV import by a data source connection, but then it would be doing what Access does…)

    I can’t understand why there is still a limit though. Why is Excel not accepting as many rows and as many columns as you want. After all, the internal memory structure of Excel 2007 is pretty much revealed in the new BIFF12 way to store (row,col) formula tokens. In BIFF12, they made the change to use variable-length encoding for storing record ids and record lengths ; they also increased the row slot to 4 bytes instead of 2 bytes ; but they could have used variable-length encoding for rows and columns as well. I don’t understand the limitation, technically speaking.

    65,536 certainly sounded a bit cheap, may be that’s why they added a few zeros…

  5. Dennis Wallentin Says:

    Excel is a spreadsheet tool and not a flat database tool or a RDBMS therefore the available number of rows in Excel 2007 will turn out to be a ‘disaster’… All in all, it will only ‘stimulate’ a dysfunctional design and use of workbooks.

    It would have been more of interest if we had the option to setup individual worksheets where we could limit the number of rows etc.

    BTW, I take this opportunity to explicit thank You, Stephane, for Your work around the BIFF 12 format.

    Kind regards,
    Dennis

  6. Dick Kusleika Says:

    A bad idea? Inappropriate? For whom? I don’t think MS should create artificial constraints in an attempt to force users to use the tools “properly”. They provide the tool, we use it. Using the proper tool is my responsibility, not the software vendors. I’ll bet there are millions of misused spreadsheets being used every day, and yet I don’t lose an ounce of sleep over it. Why would I care if people use Excel as a database or a word processor or a project scheduler? Why do you?

    Proper spreadsheeting techniques is a problem of education, not technical constraints.

  7. Simon Says:

    Dick – I care because I’m a professional developer, if we do not actively participate in the education you mention then who will?

    Vendors have a choice where they invest their development resources, personally I wish they had done more for quality/compliance. Your priorites may be different, thats fine.

    I’m sure you are right on the millions of misuses daily, and sadly I think that has a negative impact on the image of spreadsheet developers.

  8. MikeC Says:

    Dick – I don’t lose sleep over it…. apart from when I have to clean up spreadsheets where they’ve been mis-used!! I’m thankful I’m not likely to have to do that with 2007 for some time as we won’t be getting it anytime soon. I have enough cleaning up sheets with 30,000 rows thank you!

    Also, as Simon says, the more a product is mis-used, the more that mis-use is seen as “that’s the way it works”, and the less the product (and the people who work with that product) is trusted overall.

  9. Simon Says:

    No I don’t lose sleep over it either (we have a young family, so hardly get any anyway!)
    My other concern from looking at the compliance burden in the drug industry, is that SOX gets extended and spreadsheets end up almost outlawed in business/finance. A bit more self regulation may protect our powerful flexible friend.

    It seems many people are happy to make the logical jump from ‘all the s/s I’ve seen are bad’ to ‘all s/s are bad, therefore all s/s devs must be bad’, and that includes some of the less well informed regulators.

  10. Dennis Wallentin Says:

    I can live with that my solutions are being questioning due to the lack of using ‘all the available rows’ + more as well as “…all s/s devs must be bad’.

    What I find difficult to accept is that business decisions are being made with ‘poor’ information.

    One major cause of the existence of ‘poor’ information is that workbooks are used as ‘databases’ with outdated and irrelevant data.

    Kind regards,
    Dennis

  11. Jon Peltier Says:

    Could “better DB integration” take this scenario?

    Add a new class of sheet to Excel, a datasheet, which is essentially a DB table. Allow no formulas and no embedded objects in these datasheets.

    Use formulas and queries in worksheets to access the datasheets, or allow one datasheet to be a subset (filtered/sorted/joined) of other datasheets.

    We can do this now anyway with DBs and CSVs and tables and other XLS files, but it would make for some more robust Excel database usage.

    I’ve never needed more than 10k of rows that wouldn’t be covered by a datasheet approach as outlined above.

  12. Harlan Grove Says:

    Excel (and all other widely used spreadsheets) has always been misused as a system administration tool. I dealt with a newsgroup posting this past weekend about using Excel to find matches/nonmatches in two different lists of IP addresses. There are much better tools for this, even some better tools that come bundled with Windows. The problem is that most people know a little about Excel but nothing about FINDSTR or CMD.

    What Excel needs is a new type of sheet, maybe called a TableSheet, that provides unlimited rows (well, limited by system resources) and only the new Table structured references. It’d be VERY USEFUL if Microsoft would expand structured reference syntax to include row selection based on criteria matching in possibly multiple fields. Nicer still would be the ability for such TableSheets to be LINKS only (maybe read-only links to begin with) to external data sources like RDBMS tables or even CSV files.

    FWIW, TableSheets as I see them would be a bit different than Jon Peltier’s datasheets. TableSheets would allow additional columns, and those columns could contain arbitrary formulas, but the TableSheet wouldn’t support cell/range addresses. The only means of referring to anything within TableSheets (even in formulas in added columns in the TableSheet) would be through structured references, which is why I see the need for robust row/record selection syntax in structured references. Further, multiple ‘cell’ results from structured references would be arrays, not Range references, and not groups of arrays (unless Excel provides new functions to handle groups of arrays).

  13. Harlan Grove Says:

    Re Stephane Rodriguez’s response to sam:

    An autofilter applied to a table spanning all rows could produce, in theory, 32768 discontinuous areas (all the even numbered rows). Try copying the filtered rows to the clipboard. Now expand the number of rows to a million. See the problem now?

    Indifferent to how this could be taken, some Excel/VBA developers seem to lack broad practical experience working in Excel itself WITHOUT VBA. To me, those people constitute the most dangerous class of Excel/VB* developers. Right after knowledge of the business segment for which a particular spreadsheet model applies comes practical understanding of what the spreadsheet itself WITHOUT VB* can or can’t do, and VB* should only be used for those things the spreadsheet can’t do well. And in the case of filtered ranges, even VB* won’t work all the time without changing how the process for performing the task. Using the example above, if there are more than 8192 records matching a given filter, the robust approach would be sorting rather than filtering, but that likely wouldn’t occur to VB* developers who lack practical experience from periodic use of worksheet filters.

  14. Stephane Rodriguez Says:

    “Re Stephane Rodriguez’s response to sam:

    An autofilter applied to a table spanning all rows could produce, in theory, 32768 discontinuous areas (all the even numbered rows). Try copying the filtered rows to the clipboard. Now expand the number of rows to a million. See the problem now?”

    I wonder how this answers the question being asked.

    Give me a break.

  15. Harlan Grove Says:

    Generic comments re database-like functionality.

    Anyone else reading this blog ever use Lotus 123 Release 5 or later? 123R5 could run @VLOOKUP, @DSUM etc. against tables stored as Approach databases or other ODBC data sources. There was some setup necessary to make such tables appear as pseudo-range names, but that was it. +10-year-old 123R5 is still way ahead of Excel in terms of database interface functionality, sad to say.

  16. Harlan Grove Says:

    Stephane, sam made the point of multiple disjoint range selection falling further behind with the new grid size. You chose to repsond either too sarcastically for me to notice or out of fairly thorough ignorance of practical use of autofilters. Which was it?

  17. Harlan Grove Says:

    Stephane Rodriguez wrote…
    > . . . but they could have used variable-length encoding for rows and
    >columns as well. I don’t understand the limitation, technically speaking. . . .

    The Gnumeric documentation explores this in some detail. Variable length encoding is also needed in memory, and processing variable length fields take more processing time than processing fixed length fields. Since cell addressing is a very frequent task in spreadsheets, the extra processing time for variable length cell addressing would be noticeable.

  18. Guru Says:

    Simon :)

    unSigned 2^16 = 65,536;

    unSigned 2^32 = 4,294,967,296;

    From a computers perspective 2^32 would be the most efficient

    From a human (had to ask one) perspective, how long is a piece of string?

    From my perspective, 1m is artifical, give them enough rope (2^32) to hang themselves.

  19. Simon Says:

    Plenty of us are bumping up on the 8192 area limit in 2003. I hit it with select special quite often. Filtering any number of cols over 20k or so rows, as Sam says can catch you out. Sams point I assume is that this is likely to catch more people out now, and as far as I can see I think we all pretty much agree on that right?
    Harlan my 123 experience started and ended in dos, so its interesting to hear it has/had such potentially useful features. Do you happen to have a link to the Gnumeric stuff? that would be an interesting read, I know Jody, hes a smart guy. (and I think Stephane was just suggesting a workaround rather than having a go?)
    Stephane – thanks for the heads up on the file formats, I’m sure you’re right on the file to memory structure point. And I think you are right in being able to nest named ranges to get around the 8192 limit. I seem to remember from DGs blog that they sort of ‘missed’ it rather than deliberately limiting it?
    Jon – I hadn’t thought of the ‘linked’ datasheets idea, that makes loads of sense, I could live with 1 formula per column, but if you had joins like you suggest we probably wouldn’t need them. No objects – agreed.

    I have a few clients now that are operating on 100k plus rows, its such a relief to get it in Access and do it the easy way rather than twist and contort Excel to cope.
    cheers
    Simon

  20. Simon Says:

    Guru – shouldn’t you be at work?
    ;-)

  21. Simon Says:

    To be brutal, from a business perspective, the bigger and the uglier the better. That increases the chances of it going titsup which increases the chance they’ll call one of us in to fix it.
    It’ll take longer to find the problems and fix them so the more we can charge.
    Of course the other side is that the (potential) client may never spot a problem if its buried in one of several million populated cells.
    Maybe I should pull my finger out and revamp xlanalyst for 2007.

  22. Marcus Says:

    One of my personal pet hates is when someone writes a VLOOKUP that references an entire column like this: =VLOOKUP(B5,G:H,2,FALSE).

    The performance is bad enough with 65k rows – I dread to think the impact of 1M rows. People are going to have to take up a hobby while they wait for their spreadsheets to recalculate.

    While spreadsheets up to 100 mb are not uncommon, I’ve encountered one 425 Mb monster at a current client. These primarily contain data, VLOOKUPs and PivotTables.

    “Why is Excel not accepting as many rows and as many columns as you want”
    Because people being people, we push our luck far too often and use things in ways they were never intended or designed for. Many things have intentional limitations in their design to prevent misuse or abuse.

    “Why would I care… Why do you?”
    Humans have a nasty habit of blaming others when things go wrong. People have attempted to sue McDonald’s for their own obesity. I’ve seen spreadsheets used to manage deals worth 100’s of millions in banks. If the abuse of a software tool results in the bankruptcy of the bank my savings are in, you’ll see my care factor rise dramatically.

    If SoX, Basel and APRA (Australian Banking regulator) recognise the inherent dangers of spreadsheet misuse, as a professional spreadsheet developer, I think I need to be concerned also. Deferring or abdicating responsibility simply isn’t sufficient.

    ‘poor’ information is that workbooks are used as ‘databases’
    The other is that spreadsheets inherently lack enough control and security to ensure the quality of the data from inadvertent or malicious manipulation. The separation between data, business logic and presentation just isn’t there.

    TableSheets
    Sorry Harlan; while it’s a reasonable proposition I still have major issues with it. One of the other issues with data integrity is redundancy. As soon as that same data is needed in another report people are going to copy the entire dataset to a new workbook. How will changes to any of the many versions of the data be managed? Which one is right? Or worse, we could linking and end up with a labyrinth of linked spreadsheets.

    Could TableSheets perhaps be in a separate file that Excel could natively reference. Anyone who needed that data need only reference the same file. This would do more to ensure there was only one version of the truth and avoid (you knew it was coming) versionitis.

    Ironically, I’ve been doing a lot of work recently reverse engineering spreadsheet models in order to migrate them to a ‘production’ environment.

    P.S. Great, thread. It’s good to see such diverse bantering.

    Cheers – Marcus

  23. Simon Says:

    Marcus – I have some spreadsheet monster migrations coming up too.
    BTW ask on Excel-l about an Aus/Asia based Excel conf, I’m sure thats where I saw the interest last time.
    oh and you win the prize so far – 425 Mb niiice! the biggest I have worked with was 170, and that was hard enough.
    cheers
    Simon

  24. Ken Puls Says:

    Hi guys,

    I’ve hit my head on the 8192 discontinuous areas in Excel as well, but I’ve found it’s a fairly easy thing to work around. I just add an index column in the last column of the sheet that goes from 1 to the number of data rows, and also another column with formulas that will return a true/false if the condition I’d filter on is met. From there, sort it on the last column, and you only have one contiguous region to deal with. At the end of whatever process, you resort by the index column, delete the last two columns, and your data is back in the original order as well. It’s a bit of extra work, but has become such a standard way to deal with it that I couldn’t really be concerned if MS fixed that issue or not.

    Re the database functionality, what I think would be nice would be to add a table/query sheet to an Excel document that simply linked to an external DB or Query, and displayed its results. (Much like looking at table or query results in their raw form in Access.) Honestly, for most situations that I work with, it could be read-only, but must be linked live to the DB. Give it as many columns/rows as it needs to display the data, and we’re good to go.

    Ideally, we’d also see optimized functions to use against those sheets, as well.

    On the orginal question, I could fairly easily work within a limit of 10,000 rows myself, although it would be nice to have the option to increase beyond that if needed. But honestly, I can’t ever see using more than the original 65,536 available. If I did, I’d be spending some time PDQ to get it into the appropriate app.

  25. MikeC Says:

    Ken: nice workaround for the discontinuous areas. Hope you don’t mind, but I’m going to steal and mercilessly abuse that concept, as it’ll save me headaches now and again!
    Marcus: 425MB???? Is this the ss used to create the world in 7 days??? =;-)

  26. Marcus Says:

    Hi MikeC:
    I ran a quick search on the network here. There’s plenty of s/s in the 30 to 100 meg range. A few up to 170 and then I found ‘Gigantor’ which I’ve yet to be able to open (insufficient memory).

    Many users are employing Excel almost strictly as a database with huge numbers of VLOOKUP’s slowing the beasts down.

  27. Charles Says:

    I did suggest to MS that they expand the functionality of Named Tables to automate Lookups efficiently (implicit lookup syntax and Tables that know when they are sorted etc), but there was comment about the danger of duplicating Access function … I think that missed the point.

    My view is that if you are going to have >65K rows then you should improve calculation efficiency. Some steps in this direction in 2007, but for instance they really should improve the recalc efficiency of Vlookup on unsorted data, its easy enough to do.

  28. Ken Puls Says:

    MikeC, no worries. Copy away.

    I actually have an article on my site that uses this method to delete duplicates from a column, although the application can be used for much more than that, obviously. The article, if you’re interested, can be found here: http://www.excelguru.ca/node/24

    Cheers,

    Ken

  29. MikeC Says:

    Ken – cheers for that. I actually use a very similar method for deleting duplicates (a frequent task) already, though without inserting an index column to preserve the original order. Normally speaking, the order doesn’t matter in the tasks I do this in, but should I have need for it in the future, I will be sure to credit you as the source of my not having to stop and think of a way to do it! =;-)

    Marcus – imagine if they’d had a million rows to play with….!

  30. Harlan Grove Says:

    OK, TableSheets as I originally imagined maybe not a good idea. Looks like others would prefer views into databases appearing in Excel as additional sheets, but with no data stored in the file.

    That’d be OK with me AS LONG AS Excel provided a built-in function similar to the old add-in SQL.REQUEST that would allow me to extract portions of such tables (and any worksheet range with a header row containing distinct labels) with less fuss (e.g., providing a range reference rather than a connection string) OR automatically made it possible to refer to such datasheets as Tables, thus allowing structured referencing into them. I’d still like to see row selection syntax (i.e., field criteria that act as implicit filters) added to structured referencing.

    Seems like this would be heading toward something like the old Excel 4 workbooks which could contain worksheets that would be saved as part of the workbook AND links to worksheets saved as separate files.

  31. Ragnar Says:

    For manipulating columns to prepare bad data (eg Lotus Notes, old dos systems) in not so good state for import into databases, hell, I need the millions of columns.
    Recently I worked with Excel 2007 sheet with nearly 280000 rows.
    I had to mark endings for each line before reading it onto oracle, and then I converted CR and LF in long text columns into other characters (did that with stream editor) and change column delimiter.
    It was quick and easy to to ind Excel2007 around 300MB file. It would have been hell to do it by any other means. In Excel it took few minutes.

  32. Jerry Says:

    Most users who use excel for handling a large number of rows have no idea how to use a database application. Most of them wouldn’t know how to create a query in MS Access let alone in an SQL implementation. This is why I think it is important to be capable of handling a large number of rows in spreadsheets. If they knew how to handle such data in an SQL database, they would. But this is precisely the problem, they don’t, so offering them something they can work with is important.

  33. sarfaraz Says:

    I use MS excel 2003 and import 1M rows data but the excel is stop on 65536 row no. its mean MS excel 2003 is not capable to import large nummbers of rows data. but the same is import in MS excel 2007.

  34. Beverley Says:

    I have a spreadsheet with 86,000 rows. Do you know if there is a maximum number of rows that you can filter on in excel 2007?

    I want to replace the blank cells in a certain row with text. When I try to filter on the blanks, I get an error message “Excel cannot create or use the data range reference because it is too complex. Try one of the following: Use data from one contiguous rectangle or use data from the same sheet. If I cut the same spreadsheet down to about 50,000 rows, filtering works fine, but it doesn’t work OK at 86,000.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: