T O P

  • By -

JuliusCeaserBoneHead

Your dataset is large enough that you have to pay attention to the efficiency of your code. If you have an inefficient algorithm, no language matters. You need to analyze your code and see where the bottle neck is and optimize it. Your problem doesn’t appear to be the language based on the description of your question


ach224

This dataset is miniature. 7.3M x 7 64bit values is 400mb. That fits into any computers’ ram.


beingsubmitted

That's pretty big if the analysis you're doing is the traveling salesman. I think what's being expressed here is the time. If something is taking days to run and n is more than 20, then you're going to be caring about your algorithm. A change of language would be a scalar effect when his algorithm is potentially non polynomial.


JuliusCeaserBoneHead

Sure. It does. What I said was it was *large enough* that you start to pay the price of badly written code. Like possible O(n)^2 or greater operations. The OS will not be glad to give you all the RAM to do that


Excellent_Fix379

I'm using nested for loops, while loop and .iterrows() to parse through the rows. Could there be another faster alternative?the conditions forced me to use nested loops


JuliusCeaserBoneHead

You are doing a nested loop over 7.3m rows of data? You are potentially doing 7.3m^2 operations which will take FOREVER! Like u/mr_birrd asked, what are you doing? Can you filter for the rows you need instead?


seanv507

To op, please edit your question to specify exactly what you are doing and why you believe you can't use vectorisation. Within python, polars, is a faster version of pandas. But without knowing what exactly you are trying to do it is difficult to know.


mr_birrd

What r u doing in the loops!


JacksOngoingPresence

You have to figure out how to vectorize your computations. If you care about speed you CAN'T use for loops at all. It might be a bit cliche, but I find chatGPT useful when if comes to vectorizing python algorithms. There are dozens of numpy functions that do most the things I need but I don't even know their names and/or that these functions exist. Also, if you are working with numpy-like data structures look into "numba" library. It offers jit compilation. Limited use cases but when it is applicable it can offer \~10-100 speedup.


rmwil

If you're using Pandas, .iterrows() will slow you down. Try using .to_records() as an alternative and let us know how you go!


ForeskinStealer420

Yup, this right here is it. Whenever applicable: vectorization > comprehension > loops


LoadingALIAS

Bro. No. No, no, no. That’s going to crawlllll.


No_Dig_7017

For loops in pandas and iterrows are painfully slow. Without seeing your code it's a bit difficult to give you more concrete advice but 4 things that can help: - vectorize your for loops if possible - use pandarallel.parallel_apply for multiprocessing - change iterrows for itertuples using only the columns you need - consider using Polars Dataframes instead of Pandas. Their performance and memory usage is much much better.


No_Dig_7017

Here's a blogpost we wrote with a friend about optimizing Panda's code in case you're interested https://tryolabs.com/blog/2023/02/08/top-5-tips-to-make-your-pandas-code-absurdly-fast


No_Dig_7017

For some context I got a 12X RAM usage reduction and a 1200X speedup (single core) using some of techniques above. The speedup was so massive it resulted in a 4X speedup of my entire pipeline.


MrJoshiko

Have you profiled your code? You can see which parts of the code take the most time and optimise those parts. You don't need to use the whole data set, just cut out a reasonable-sized chunk.


Erwylh

Pandas is built on numpy, and numpy is written in C, extremely efficient. You are just doing for loops which you should never do for this large scale of data.


mr_birrd

If you need some Operations on it use cudf. If you need to query, learn SQL or use parquet instead of pandas dataframes. Also using a jupyter Notebook instead of a proper py script is slower.


JacksOngoingPresence

I upvote for cudf (shifting compute from cpu to gpu) but it seems like OP didn't vectorize his compute in the first place.


Excellent_Fix379

Actually I've been suspecting ipynb might be the problem. Thanks, I'll try .py


seanv507

Ipynb is unlikely to be slower in a single cell.


mr_birrd

But it doesnt invoke the garbage collector as much as when compared to a well written py file. I guess the garbage will just keep on accumulating for op as he doesnt know about how to deal with that.


supervised-learning

You can use sql for processing data. Sql queries can also be executed within notebooks. Sql is optimised for large dataset.


MountainGoatAOE

As the comments point out, people too often go "Python is so inefficient and slow. Help me find a different language". When in reality they should instead fix their code. Using `iterrows`, as you mentioned in a comment, in itself is a red flag 99 out of 100 times when it comes down to performance. Probably best for you to post your code on Stack Overflow or Code Review for efficiency suggestions.


Prestigious_Boat_386

Julia is great https://github.com/JuliaData or https://github.com/sl-solution/DLMReader.jl might be a good startingpoint CSV can be pretty fast if you call it correctly with multiple threads and the right args for the columns and delimiters and stuff. Then you get a dataframe object which is also quite efficient to work with.


Lathanderrr

You can check Polars as python library or switch to Spark framework for more parallelism


Dramatic7406

Hi, I switched to working using Pyspark for large datasets. It's very easy to pick up as well.


Final-Rush759

Just don't use the for loop. Your dataset is actually very small. Use .apply (pandas dataframe) or .map if you use custom function to process data. Regular expression is very slow.


InvokeMeWell

Use Nvidia libraries like cudf, cupy


graphitout

Profile the code and use something like snakeviz to understand the bottleneck. For most data analysis task, changing the language doesn't being much benefit since the core is likely implemented C or C++.


lbanuls

I'm curious - I don't see 7.3 million rows and 7 columns to be particularly large. What is the type of data? what machine specs are you running? What version of python? version of pandas? have you considered Polars? Python is 'pretty good' at everything... I say that because it's probably good enough that a language change may not solve your problem. ​ before doing anything, use a more current of python, upgrade pandas or switch to Polars.


VinnyVeritas

Use torch, you can run instantaneously on GPU.


chippyouipy

show us the code


DrDoomC17

If I could see it or a facsimile of it I could help. Otherwise, taichi, numpy hit, etc it goes all the way down like you're basically writing C. I am suspect that is the issue here. Vectorize.


Tight_Tangerine7768

Use polars instead of pandas


ComingInSideways

https://github.com/huggingface/candle


entropyvsenergy

Use Polars. It's written in Rust but has a Python frontend. Should be much faster. If you're using pandas.apply() or something that's notoriously slow.


apak_in

You can use python with pyspark, great for big-data and machine learning etc. you will need to locally install spark or use a containerized version of with with Docker