Langdes: A New Approach for Improving the Performance of Prompt-based Image Editing in Interior Design Setting

Authors

  • Victor-Eugen ZARZU Faculty of Mathematics and Computer Science, Babeș-Bolyai University, Cluj-Napoca, Romania. Email: victor.zarzu@stud.ubbcluj.ro

DOI:

https://doi.org/10.24193/subbi.2024.2.02

Keywords:

Diffusion models, Prompt-based image editing, Deep learning, Attention, Data generation

Abstract

The topic of instruction-based image editing has gotten a lot of attention in recent years with a lot of research conducted due to its immense potential in various applications such as removing unwanted details present in existing images or improving them. However, one of the main problems in addressing this problem is acquiring a dataset for model training. Several methods and variations were proposed, but all of them rely on already-existent data. We propose a method to address this problem by creating a context-specific dataset for interior design with no previously available information by leveraging the knowledge of large language models (LLM). Furthermore, we test and prove the efficiency of the generated dataset on InstructPix2Pix which starts to compute better results for the interior-design setting after the fine-tuning. Moreover, we propose an alternative solution for enhancing the localization of the edit region through cross-attention map regularization based on a text-based segmentation mask.

Received by editors: 11 October 2024

2010 Mathematics Subject Classification. 68T05, 68T45.

1998 CR Categories and Descriptors. I.2.6 [Learning]: Subtopic– Connectionism and neural nets; I.2.10 [Vision and Scene Understanding]: Subtopic– 3D/stereo scene analysis.

References

1. Alex Reuneker. Lexical Diversity Measurements. https://www.reuneker.nl/files/ld, 2017. Accessed: 2024-01-15.

2. Rohan Anil, Sebastian Borgeaud, et al. Gemini: A Family of Highly Capable Multimodal Models. CoRR, abs/2312.11805, 2023.

3. Tim Brooks, Aleksander Holynski, and Alexei A. Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Canada, 2023, pages 18392–18402. IEEE, 2023.

4. Tom Brown, Benjamin Mann, Nick Ryder, Subbiah, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.

5. Huiwen Chang, Han Zhang, and et al. Muse: Text-To-Image Generation via Masked Generative Transformers. In International Conference on Machine Learning, ICML 2023, Honolulu, Hawaii, USA, volume 202, pages 4055–4075. PMLR, 2023.

6. Xiaoliang Dai, Ji Hou, et al. Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack. CoRR, abs/2309.15807, 2023.

7. Michael Daller. Guiraud’s index. 2010.

8. Daniel Dugast. La Statistique Lexicale. SLATKINE, 1980.

9. G. Udny Yule. The Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.

10. Rinon Gal, Or Patashnik, Haggai Maron, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Trans. Graph., 41(4):141:1–141:13, 2022.

11. Amir Hertz, Ron Mokady, et al. Prompt-to-Prompt Image Editing with Cross- Attention Control. In The Eleventh International Conference on Learning Representations, 2023. OpenReview.net, 2023.

12. Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. CoRR, abs/2207.12598:1–14, 2022.

13. Albert Q. Jiang, Alexandre Sablayrolles, and et al. Mixtral of Experts. CoRR, abs/2401.04088, 2024.

14. Chang Liu, Henghui Ding, and Xudong Jiang. GRES: generalized referring expression segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 23592–23601. IEEE, 2023.

15. Philip M. McCarthy and Scott Jarvis. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42:381–392, 2010.

16. Meta AI. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3, 2024. Accessed: 2024-05-10.

17. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.

18. Maxime Oquab, Timothée Darcet, et al. DINOv2: Learning Robust Visual Features without Supervision. CoRR, abs/2304.07193, 2023.

19. Koutilya PNVR, Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, and David Jacobs. Ld-znet: A latent diffusion approach for text-based image segmentation. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 4134–4145. IEEE, 2023.

20. Alec Radford, Jong Wook Kim, et al. Learning Transferable Visual Models From Natural Language Supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, volume 139, pages 8748–8763. PMLR, 2021.

21. Robin Rombach, Andreas Blattmann, et al. High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pages 10674–10685. IEEE, 2022.

22. Chitwan Saharia, William Chan, Saxena, et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022.

23. Christoph Schuhmann, Romain Beaumont, et al. LAION-5B: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, volume 35, pages 25278–25294. Curran Associates, Inc., 2022.

24. Shelly Sheynin, Adam Polyak, and et al. Emu Edit: Precise Image Editing via Recognition and Generation Tasks. CoRR, abs/2311.10089, 2023.

25. Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, and Federico Tombari. LIME: localized image editing via attention regularization in diffusion models. CoRR, abs/2312.09256, 2023.

26. Ivona Tautkute, Aleksandra Mozejko, and et al. What Looks Good with my Sofa: Multimodal Search Engine for Interior Design. CoRR, abs/1707.06907, 2017.

27. Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR, abs/2307.09288, 2023.

28. Wenxuan Wang, Tongtian Yue, et al. Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation. CoRR, abs/2312.08007, 2023.

29. Victor-Eugen Zarzu. Dataset for interior design. https://huggingface.co/datasets/victorzarzu/interior-design-edit-captions, 2024.

30. Victor-Eugen Zarzu. Fine-tuned InstructPix2Pix model. https://huggingface.co/victorzarzu/ip2p-interior-design-ft, 2024.

31. Victor-Eugen Zarzu. Fine-tuned InstructPix2Pix model on the dataset with unchanged images. https://huggingface.co/victorzarzu/ip2p-interior-design-ft-unchanged-one-epoch, 2024.

32. Victor-Eugen Zarzu. Interior design fine-tuned version of Stable Diffusion. https://huggingface.co/stablediffusionapi/interiordesignsuperm, 2024.

33. Victor-Eugen Zarzu. Testing data. https://huggingface.co/datasets/victorzarzu/interior-design-prompt-editing-dataset-test, 2024.

34. Victor-Eugen Zarzu. Training data. https://huggingface.co/datasets/victorzarzu/interior-design-prompt-editing-dataset-train, 2024.

Downloads

Published

2025-02-04

How to Cite

ZARZU, V.-E. (2025). Langdes: A New Approach for Improving the Performance of Prompt-based Image Editing in Interior Design Setting. Studia Universitatis Babeș-Bolyai Informatica, 69(2), 23–38. https://doi.org/10.24193/subbi.2024.2.02

Issue

Section

Articles

Similar Articles

1 2 3 4 5 6 7 8 > >> 

You may also start an advanced similarity search for this article.