Reading & Writing Data#


Storing data and the results of your calculations is important and common practice in scientific programming to disentangle the data creation and analysis. There are various ways to do so. Here, we discuss speficfically ASCII (TXT) and HDF5 formats.

ASCII Data#


Loading ASCII Data#

We will use NumPy’s genfromtxt() function for this:

import numpy as np

myData = np.genfromtxt(fname='./data/01_xy.dat', 
                       delimiter=' ')

print( myData )
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 3
      1 import numpy as np
----> 3 myData = np.genfromtxt(fname='./data/01_xy.dat', 
      4                        delimiter=' ')
      6 print( myData )

File /usr/local/lib/python3.14/site-packages/numpy/lib/_npyio_impl.py:1991, in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows, encoding, ndmin, like)
   1989     fname = os.fspath(fname)
   1990 if isinstance(fname, str):
-> 1991     fid = np.lib._datasource.open(fname, 'rt', encoding=encoding)
   1992     fid_ctx = contextlib.closing(fid)
   1993 else:

File /usr/local/lib/python3.14/site-packages/numpy/lib/_datasource.py:192, in open(path, mode, destpath, encoding, newline)
    155 """
    156 Open `path` with `mode` and return the file object.
    157 
   (...)    188 
    189 """
    191 ds = DataSource(destpath)
--> 192 return ds.open(path, mode, encoding=encoding, newline=newline)

File /usr/local/lib/python3.14/site-packages/numpy/lib/_datasource.py:529, in DataSource.open(self, path, mode, encoding, newline)
    526     return _file_openers[ext](found, mode=mode,
    527                               encoding=encoding, newline=newline)
    528 else:
--> 529     raise FileNotFoundError(f"{path} not found.")

FileNotFoundError: ./data/01_xy.dat not found.

Or without defining the delimiter:

myData = np.genfromtxt('./data/01_xy.dat')

print( myData )
[[0.298495 0.535602]
 [0.26463  0.450345]
 [0.328381 0.364419]
 ...
 [0.851956 0.623053]
 [0.296291 0.341577]
 [0.271582 0.307172]]

Unpack the columns directly into two variables:

myCol1, myCol2 = np.genfromtxt('./data/01_xy.dat',
                               unpack=True)

print( myCol1 )
print( myCol2 )
[0.298495 0.26463  0.328381 ... 0.851956 0.296291 0.271582]
[0.535602 0.450345 0.364419 ... 0.623053 0.341577 0.307172]

Or via a simple transpose():

myCol1, myCol2 = np.genfromtxt('./data/01_xy.dat').transpose()

print( myCol1 )
print( myCol2 )
[0.298495 0.26463  0.328381 ... 0.851956 0.296291 0.271582]
[0.535602 0.450345 0.364419 ... 0.623053 0.341577 0.307172]

Sometimes we want to skip a few lines in the beginningm, as they might be header lines, etc.:

myCol1, myCol2 = np.genfromtxt('./data/01_xy.dat',
                               unpack=True,
                               skip_header=2)

print( myCol1 )
print( myCol2 )
[0.328381 0.189954 0.422187 ... 0.851956 0.296291 0.271582]
[0.364419 0.438772 0.442186 ... 0.623053 0.341577 0.307172]

Saving ASCII Data#

Generate some 1D test data:

testData1D = np.arange(0,20)

print( testData1D )
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

Save the 1D test data using NumPy’s savetxt() function:

np.savetxt('./data/test/01_test_data_1D.dat',
           testData1D)

… in a gnu-zipped way:

np.savetxt('./data/test/01_test_data_1D.gz',
           testData1D)

… which can also be read in again:

testData1DZipped = np.genfromtxt('./data/test/01_test_data_1D.gz')

print( testData1DZipped )
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19.]

Simple data sanity (or “data integrity”) check:

print("Difference between original and saved data: ", 
      np.sum( testData1D - testData1DZipped ) )
Difference between original and saved data:  0.0

This also works with 2D data:

testData2D = np.arange(0,20).reshape(10,2)

print( testData2D )
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]]
np.savetxt('./data/test/01_test_data_2D.dat',
           testData2D)

Self-Defined ASCII:#

For more information on string formating see:

# open a file for writing (replace existing)
fid = open('./data/test/01_test_data_2D.mydat', mode='w')

for col1, col2 in testData2D:

    fid.write( f" {col1:d} | {col2:d} \n" )

fid.close()

Or as floats:

# open a file for writing (replace existing)
fid = open('./data/test/01_test_data_2D.mydat', mode='w')

for col1, col2 in testData2D:

    fid.write(f" {col1:6.3f} | {col2:6.3f} \n")

fid.close()

Hierarchical Data Format (HDF)#


The following is taken from h5py. Start there for further details

  • HDF5 lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays.

  • Thousands of datasets can be stored in a single file, categorized and tagged however you want.

  • An HDF5 file is a container for two kinds of objects: datasets, which are array-like collections of data, and groups, which are folder-like containers that hold datasets and other groups.

Note

Groups work like dictionaries, and datasets work like NumPy arrays.

We start with loading the h5py module and with creating some 2D test data:

import h5py

testData2D = np.arange(0,20).reshape(10,2)
print(testData2D)
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]]

We can write hdf5 archives with the with ... as statement and create_dataset(...):

with h5py.File('./data/test/01_test.hdf5', 'w') as myFile:
    
    myFile.create_dataset("myData", data=testData2D)
    

And we can load the data in a very similar way:

with h5py.File('./data/test/01_test.hdf5', 'r') as myFile:
    
    testData2DHDF5 = myFile["myData"]
    
    print("Difference between original and saved data: ", 
          np.sum( testData2D - testData2DHDF5 ) )
Difference between original and saved data:  0

Note

We can store different kind of data in our hdf5 archives!

testData2DA = np.arange(0,20).reshape(10,2)
testData2DB = np.arange(0,60).reshape(10,2,3)
testData2DC = np.arange(0,10)

with h5py.File('./data/test/02_test.hdf5', 'w') as myFile:
    
    myFile.create_dataset("myDataA", data=testData2DA)
    
    myFile.create_dataset("myDataB", data=testData2DB)
    
    myFile.create_dataset("myDataC", data=testData2DC)

For loading we can also use the handle returned by h5py.File(...).

Warning

In this case we should never forget to close the hdf5 archive again.

myFile = h5py.File('./data/test/02_test.hdf5', 'r')

print( myFile )

myFile.close()
<HDF5 file "02_test.hdf5" (mode r)>

How do we know what’s saved in our hdf5 archive?

Method A: using IPython Magics to execute commands in the terminal:

!h5dump -H ./data/test/02_test.hdf5
HDF5 "./data/test/02_test.hdf5" {
GROUP "/" {
   DATASET "myDataA" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 10, 2 ) / ( 10, 2 ) }
   }
   DATASET "myDataB" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 10, 2, 3 ) / ( 10, 2, 3 ) }
   }
   DATASET "myDataC" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }
   }
}
}

Mehtod B: Directly within Python:

myFile = h5py.File('./data/test/02_test.hdf5', 'r')

print( myFile.keys() )

myFile.close()
<KeysViewHDF5 ['myDataA', 'myDataB', 'myDataC']>

As soon as we know the keys, we can access the data directly:

myFile = h5py.File('./data/test/02_test.hdf5', 'r')

print( myFile['myDataA'] )
print( myFile.get('myDataA') )

myFile.close()
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
myFile = h5py.File('./data/test/02_test.hdf5', 'r')

print( myFile['myDataA'] )
print( myFile['myDataB'] )
print( myFile['myDataC'] )

myFile.close()
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
<HDF5 dataset "myDataB": shape (10, 2, 3), type "<i8">
<HDF5 dataset "myDataC": shape (10,), type "<i8">

Ok, but how do we really get the data?

myFile = h5py.File('./data/test/02_test.hdf5', 'r')

print( np.array(myFile['myDataA']) )
print( np.array(myFile['myDataB']) )
print( np.array(myFile['myDataC']) )

myFile.close()
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]]
[[[ 0  1  2]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]

 [[12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]]

 [[24 25 26]
  [27 28 29]]

 [[30 31 32]
  [33 34 35]]

 [[36 37 38]
  [39 40 41]]

 [[42 43 44]
  [45 46 47]]

 [[48 49 50]
  [51 52 53]]

 [[54 55 56]
  [57 58 59]]]
[0 1 2 3 4 5 6 7 8 9]

Let’s have a look at another example using create_group(), i.e. the “sub-directories”:

# generate some test data
testData2DA = np.arange(0,20).reshape(10,2)
testData2DB = np.arange(0,60).reshape(10,2,3)
testData2DC = np.arange(0,10)

with h5py.File('./data/test/03_test.hdf5', 'w') as myFile:
    
    g1 = myFile.create_group('group1')
    
    g1.create_dataset("myDataA", data=testData2DA)
    g1.create_dataset("myDataB", data=testData2DB)
    
    g2 = myFile.create_group('group2')
    
    g2.create_dataset("myDataC", data=testData2DC)
!h5dump -H ./data/test/03_test.hdf5
HDF5 "./data/test/03_test.hdf5" {
GROUP "/" {
   GROUP "group1" {
      DATASET "myDataA" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SIMPLE { ( 10, 2 ) / ( 10, 2 ) }
      }
      DATASET "myDataB" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SIMPLE { ( 10, 2, 3 ) / ( 10, 2, 3 ) }
      }
   }
   GROUP "group2" {
      DATASET "myDataC" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }
      }
   }
}
}
myFile = h5py.File('./data/test/03_test.hdf5', 'r')

print( myFile.keys() )

print( myFile['group1/myDataA'] )
print( myFile['group1/myDataB'] )
print( myFile['group2/myDataC'] )

myFile.close()  # don't forget to close!
<KeysViewHDF5 ['group1', 'group2']>
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
<HDF5 dataset "myDataB": shape (10, 2, 3), type "<i8">
<HDF5 dataset "myDataC": shape (10,), type "<i8">
with h5py.File('./data/test/03_test.hdf5', 'r') as myFile:

    # get keys
    print( 'keys:', myFile.keys() )
    print( 'group 1 keys:', myFile['group1'].keys() )
    print( 'group 2 keys:', myFile['group2'].keys() )

    # access data
    print( myFile['group1/myDataA'] )
    print( myFile['group1/myDataB'] )
    print( myFile['group2']['myDataC'] ) # Note the difference
    
    myDataA = np.array(myFile['group1']['myDataA'])
    myDataB = np.array(myFile['group1']['myDataB'])
    myDataC = np.array(myFile['group2']['myDataC'])
keys: <KeysViewHDF5 ['group1', 'group2']>
group 1 keys: <KeysViewHDF5 ['myDataA', 'myDataB']>
group 2 keys: <KeysViewHDF5 ['myDataC']>
<HDF5 dataset "myDataA": shape (10, 2), type "<i8">
<HDF5 dataset "myDataB": shape (10, 2, 3), type "<i8">
<HDF5 dataset "myDataC": shape (10,), type "<i8">

Warning

Depending on the choosen hdf5 “driver” your data might not be availble anymore after closing the file!

with h5py.File('./data/test/03_test.hdf5', 'r') as myFile:
    
    myDataA = myFile['group1']['myDataA']
    
print(myDataA)
<Closed HDF5 dataset>

One way to solve this is to convert your hdf5 “pointer” to an actual NumPy array:

with h5py.File('./data/test/03_test.hdf5', 'r') as myFile:
    
    myDataA = np.array( myFile['group1']['myDataA'] )
    
print(myDataA)
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]]

Other Formats#

Numpy Arrays:#

see also:

# generate some 2D test data
testData2D = np.arange(0,20).reshape(10,2)

# save it
np.save( './data/test/01_test_data_2D.npy', testData2D )

# and load it again
testData2DNumpy = np.load( './data/test/01_test_data_2D.npy' )

# compare it
print( "Difference between original and saved data: ", np.sum( testData2D - testData2DNumpy ) )

Pickle:#

# load pickle module
import pickle

# generate some 2D test data
testData2D = np.arange(0,20).reshape(10,2)

# save it
pickle.dump( testData2D, open('./data/test/01_test_data_2D.pickle', 'wb') )

# and load it again
testData2DPickle = pickle.load( open( './data/test/01_test_data_2D.pickle', 'rb' ) )

# compare it
print( "Difference between original and saved data: ", np.sum( testData2D - testData2DPickle ) )