TechAE Blogs - Explore now for new leading-edge technologies

TechAE Blogs - a global platform designed to promote the latest technologies like artificial intelligence, big data analytics, and blockchain.

Full width home advertisement

Post Page Advertisement [Top]

Data Generation

Generate Bulk Test data now up to 100TB using tpc-ds kit for big data analysis

Step 1: Don't go to the root

Step 2: sudo apt-get update

Run this command to update Linux dependencies.

Linux Dependencies

Step 3: sudo apt-get install gcc make flex bison byacc git

Now, installing some libraries named gcc, make, flex, bison, byacc, and git.

Libraries install

Step 4: git clone https://github.com/gregrahn/tpcds-kit.git

Cloning Github repository

Cloning from Github

Step 5: cd tpcds-kit/tools

Moving to tpcds-kit/tools directory.

Step 6: make OS=LINUX

Last but not least, generating datasets from Github according to OS version.

Downloading datasets

Step 7: ./dsdgen -scale 5 -force

Lastly, this command will allow you to generate 5 GB of test data including 24 .dat extension files.

Generating datasets

You can generate up to 100TB of test data just by changing the scale value in the above command. The below table shows Row counts per scale factor.


Conclusion

Initial test data generation is easy using these 7 steps. You can also take datasets from Kaggle or any other website. For official documentation, you can refer to this document.

No comments:

Post a Comment

Thank you for submitting your comment! We appreciate your feedback and will review it as soon as possible. Please note that all comments are moderated and may take some time to appear on the site. We ask that you please keep your comments respectful and refrain from using offensive language or making personal attacks. Thank you for contributing to the conversation!

Bottom Ad [Post Page]